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Abstract 

The  resolutions  of  interests  in  atmospheric  simulations  require  prohibitively  large  computational  resources. 
Adaptive  mesh  refinement  (AMR)  tries  to  mitigate  this  problem  by  putting  high  resolution  in  crucial  areas  of 
the  domain.  We  investigate  the  performance  of  a  tree-based  AMR  algorithm  for  the  high  order  discontinuous 
Galerkin  method  on  quadrilateral  grids  with  non-conforming  elements.  We  perform  a  detailed  analysis  of 
the  cost  of  AMR  by  comparing  this  to  uniform  reference  simulations  of  two  standard  atmospheric  test 
cases:  density  current  and  rising  thermal  bubble.  The  analysis  shows  up  to  15  times  speed-up  of  the  AMR 
simulations  with  the  cost  of  mesh  adaptation  below  1%  of  the  total  runtime.  We  pay  particular  attention  to 
the  implicit-explicit  (IMEX)  time  integration  methods  and  show  that  the  ARK2  method  is  more  robust  with 
respect  to  dynamically  adapting  meshes  than  BDF2.  Preliminary  analysis  of  preconditioning  reveals  that  it 
can  be  an  important  factor  in  the  AMR  overhead.  The  compiler  optimizations  provide  significant  runtime 
reduction  and  positively  affect  the  effectiveness  of  AMR  allowing  for  speed-ups  greater  than  it  would  follow 
from  the  simple  performance  model. 
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1.  Introduction 

Atmospheric  flows  are  characterized  by  a  vast  spectrum  of  spatial  and  temporal  scales,  from  weather 
fronts  and  planetary  waves  covering  thousands  of  kilometers  and  lasting  weeks,  to  turbulent  motions  at 
the  micro  scale.  Due  to  limitations  in  computational  resources,  we  are  not  able  to  resolve  all  phenomena. 
Most  models  assume  a  uniform  mesh  and  therefore  distribute  computational  resources  uniformly  across  the 
domain.  The  scales  of  motion  in  the  atmosphere  are  not,  however,  distributed  uniformly  both  in  space  and 
time.  The  goal  of  adaptive  mesh  refinement  (AMR)  is  to  focus  the  resolution  of  the  mesh  (and  therefore 
computational  resources)  where  it  is  most  required.  Dynamic  adaptation  aims  to  follow  the  important 
structures  of  the  flow  and  modify  the  mesh  as  the  simulation  progresses,  according  to  some  refinement 
criterion.  An  example  of  such  a  situation  in  the  atmosphere  is  a  hurricane,  which  is  an  event  of  significance 
that  is  relatively  localized  within  a  global  domain  but  traverses  vast  distances.  Static  adaptation,  on  the 
other  hand,  aims  to  refine  the  mesh  once  at  the  beginning  of  the  simulation,  which  allows  to  focus  the 
computational  resources  at  a  particular  area  of  interest  in  the  domain.  In  such  a  way  one  could  well  resolve 
a  certain  part  of  the  globe  for  which  the  weather  forecast  is  to  be  performed,  leaving  the  rest  of  the  domain 
at  much  coarser  resolution.  In  this  paper  we  focus  on  dynamic  mesh  refinement,  which  we  present  on  a 
couple  of  atmospheric  test  cases.  All  the  methods  that  we  discuss,  however,  are  readily  applicable  to  static 
adaptation. 
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In  order  to  discretize  the  equations  we  use  the  discontinuous  Galerkin  (DG)  method  on  quadrilateral 
element  grids.  The  DG  method  has  gained  a  significant  level  of  interest  in  recent  years  and  there  has  been 
a  number  of  efforts  to  apply  it  to  hydrostatic  [1]  and  nonhydrostatic  [2]  atmospheric  flows.  The  method 
benefits  from  great  data  locality  and  high  computation  intensity,  which  helps  to  scale  exceptionally  well  on 
a  large  number  of  processors.  It  also  supports  arbitrary  high  order,  which,  among  other  benefits,  allows  for 
accurate  handling  of  the  non-conforming  edge  fluxes,  which  can  arise  in  the  process  of  mesh  adaptation. 
AMR  for  DG  was  successfully  applied  in  different  applications  (e.g.  shock  capturing  [3,  4],  mantle  convection 
[5]).  Here  we  focus  on  atmospheric  flows,  particularly  the  dry  dynamics  governed  by  the  Euler  equations. 

The  question  we  would  like  to  answer  is:  how  does  AMR  benefit  a  simulation?  To  try  to  answer  this 
question  we  perform  a  detailed  analysis  of  the  cost  of  AMR  comparing  this  to  the  results  of  a  uniformly 
refined  simulation.  To  our  knowledge,  such  a  detailed  study  has  not  been  performed  previously.  Moreover, 
we  seek  to  answer  this  question  in  light  of  the  use  of  implicit-explicit  (IMEX)  time-integrators  because  for 
real-world  applications  (such  as  nonlrydrostatic  numerical  weather  prediction)  explicit  time-integration  is 
not  feasible  due  to  the  small  time-step  restriction  of  the  fast  acoustic  waves. 

The  following  subsections  present  a  brief  overview  of  the  work  done  on  AMR  for  atmospheric  simulation 
in  general  and  specifically  in  conjunction  with  the  DG  method. 

1.1.  AMR  in  atmospheric  simulations 

Jablonowski  [6]  and  Behrens  [7]  give  a  good  overview  of  the  state  of  adaptive  atmospheric  modeling. 
The  hurricane  modeling  community  was  the  first  to  use  nested  grid  techniques  for  their  simulations.  In  this 
approach  fine  scale  grids  were  nested  into  large-scale  coarse  meshes  and  communication  was  allowed  from 
large  to  small  scales  [8,  9]  or  in  both  directions  [10].  The  nested  grid  can  move  with  the  feature  it  tracks,  but 
often  some  previous  knowledge  of  the  grid  movement  is  required  and  the  number  of  points  remains  constant 
throughout  the  simulation. 

Another  example  of  a  mesh  modification  technique  which  preserves  the  number  of  points  and  grid  con¬ 
nectivity  throughout  the  simulation  is  mesh  stretching,  i.e.,  changing  the  grid  spacing  by  the  use  of  trans¬ 
formation  functions  (also  known  as  r-refinement).  Examples  of  such  an  approach  are  presented  in  [11,  12] 
and  more  recently  in  [13]. 

This  paper  focuses  on  an  adaptive  method  which  does  not  involve  a  moving  mesh  but  rather  refines  the 
elements  dynamically  in  regions  of  particular  interest.  This  technique  is  typically  referred  to  as  dynamic 
AMR.  The  first  atmospheric  models  involving  AMR  were  developed  by  Skamarock  and  Klemp  [14,  15]. 
They  used  the  technique  of  Berger  and  Oliger  [16]  where  a  finite  difference  method  is  used  to  integrate  the 
dynamical  equations  first  on  a  coarse  and  then  on  the  finer  grids.  In  order  to  determine  the  location  of  finer 
grids  a  criterion  based  on  truncation  error  estimation  was  used.  The  technique  of  Berger  and  Oliger  [16] 
and  Berger  and  Collela  [17]  is  referred  to  as  block-structured  AMR.  It  was  later  used  in  a  number  of  studies 
including  LeVeque  [18]  and  Nikiforakis  [19]. 

The  only  operational  weather  model  which  uses  dynamic  AMR  is  OMEGA  [20].  OMEGA  is  based  on 
the  finite  volume  MPDATA  scheme,  originally  developed  by  Smolarkiewicz  [21],  which  uses  unstructured 
triangular  meshes.  The  dynamic  adaptation  capabilities  were  implemented  to  the  MPDATA  model  by  Iselin 
and  coworkers  [22],  Another  example  of  AMR  which  uses  triangular  elements  is  the  work  of  Giraldo  [23] 
who  used  the  Lagrange-Galerkin  method.  In  all  of  these  approaches  the  mesh  is  obtained  by  a  Delaunay 
triangulation  of  the  domain  given  some  mesh  size  criteria.  When  the  criterion  indicates  that  the  mesh  should 
be  adjusted,  the  triangulation  is  performed  on  the  entire  domain,  and  the  solution  is  projected  from  the  old 
mesh  to  the  new  one. 

A  different  approach  is  presented  in  the  work  of  Behrens  [24,  25],  where  the  triangular  elements  indicated 
for  refinement  are  subdivided  making  sure  the  conformity  of  the  edges  is  preserved.  The  method  allows  for 
local  refinement  without  modifying  the  entire  domain.  It  is  well  suited  for  triangular  meshes,  however  the 
same  approach  for  quadrilateral  grids  would  be  difficult  as  maintaining  conformity  of  locally  dynamically 
refined  quadrilaterals  is  more  challenging  to  achieve.  An  example  of  static  conforming  quadrilateral  mesh 
refinement  can  be  found  in  [26]. 

Quadrilateral  meshes  yield  to  local  refinement  easily,  provided  that  we  allow  non-conforming  elements 
in  the  grid.  Examples  of  quadrilateral-based  dynamic  AMR  methods  for  the  shallow  water  equations  can 
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be  found  in  the  work  of  Jablonowski  [6].  St-Cyr  et  al.  [27]  compare  an  adaptive  cubed-sphere  grid  spectral 
element  shallow  water  model  with  an  adaptive  finite  volume  method  by  [6]  to  investigate  the  applicability 
of  tree-based  AMR  algorithms  to  atmospheric  models. 

1.2.  Element  based  Galerkin  methods  for  atmospheric  AMR 

The  high  order  element  based  Galerkin  methods  for  AMR  in  atmospheric  applications  is  a  fairly  new  field 
of  study.  These  methods  present  a  new  set  of  challenges  and  possibilities  for  adaptive  mesh  refinement.  By 
expanding  the  solution  in  a  basis  of  high  order  polynomials  in  each  element,  one  can  dynamically  adjust  the 
order  of  these  basis  functions,  which  can  differ  across  elements.  This  kind  of  approach  is  called  p-refinement, 
and  an  example  of  such  a  technique  applied  to  the  shallow  water  equations  can  be  found  in  [28] . 

In  the  previous  section  we  already  mentioned  the  element  mesh  refinement  (so  called  ^.-refinement), 
which  focuses  on  refining  the  mesh  while  keeping  the  polynomial  order  constant  across  the  elements.  If  we 
choose  to  allow  non-conforming  elements,  the  challenge  in  this  approach  is  the  appropriate  treatment  of  the 
non-confoming  faces.  For  the  DG  method  one  needs  to  compute  the  flux  across  the  non-conforming  faces. 
The  mathematical  solution  to  this  problem  was  proposed  by  Kopriva  [29]  as  well  as  Maday  et  al.  [30]  who 
formulated  the  mortar  method  for  non-conforming  elements.  Examples  of  application  of  /i-refinement  to 
atmospheric  flows  can  be  found  in  [31],  where  the  spectral  element  method  for  geophysical  and  astrophysical 
applications  was  used,  or  [27]  where  the  comparison  of  a  finite  volume  and  spectral  element  AMR  code  for 
the  shallow  water  equations  was  performed.  A  recent  study  by  Muller  et  al.  [32]  investigates  the  dynamic 
adaptation  of  triangular  conforming  meshes  and  addresses  the  question  whether  coarsening  the  mesh  in 
certain  areas  of  the  grid  affects  the  solution  in  a  significant  way.  Brdar  et  al.  [33]  perform  the  comparison 
of  two  dynamical  cores  for  numerical  weather  prediction  and  mention  that  the  DUNE  code,  which  uses  the 
DG  method,  has  AMR  capabilities,  however  no  adaptive  mesh  examples  are  discussed  in  that  paper. 

The  p  and  h  refinement  methods  can  be  combined  together.  [34]  shows  the  application  of  such  an 
algorithm  for  the  DG  method  using  triangular,  non-conforming  elements  applied  to  shallow  water  equa¬ 
tions.  Another  example  for  this  set  of  equations  is  presented  in  [35],  where  an  /ip-adaptive  DG  method  on 
quadrilateral,  non-conforming  elements  for  global  tsunami  simulations  is  considered. 

In  this  paper  we  focus  on  the  quadrilateral,  non-conforming  DG  method  for  the  Euler  equations  and 
provide  an  in-depth  analysis  of  the  performance  of  our  implementation.  To  our  knowledge  this  is  the  first 
work  on  tree-based  non-conforming  AMR  for  the  nonhydrostatic  atmospheric  equations.  Furthermore,  to 
our  knowledge  no  previous  work  has  been  published  on  non-conforming  AMR  for  high-order  DG  methods 
for  these  equations. 

This  paper  is  organized  as  follows:  Sec.  2  gives  a  brief  overview  of  the  equations  we  are  solving,  Sec. 
3  provides  an  outline  of  the  DG  method,  Sec.  4  discusses  the  difference  between  conforming  and  non- 
conforming  mesh.  In  Sec.  5  we  describe  the  details  of  the  mesh  adaptation  algorithm.  Finally,  in  Sec.  7 
we  provide  the  outline  of  the  test  cases  followed  by  the  discussion  of  the  results  in  Sec.  8.  The  paper  is 
concluded  in  Sec.  9  and  supplemented  with  an  Appendix,  which  describes  in  detail  the  formulation  of  the 
projection  method. 

2.  Governing  equations 

Non-liydrostatic  atmospheric  dynamical  processes  in  NUMA1  are  governed  by  the  compressible  Euler 
equations  in  conservative  form  which  uses  density  p,  momentum  pu  and  density  potential  temperature  p8 
as  state  variables  (see,  e.g.,  [36]  for  other  forms).  We  use  the  following  equation  set: 

f +  v.Ou)  =  o, 

+  V  •  (pu  ®  u  +  PI)  =  -ppk  +  V  •  (ppVu),  (1) 

^+V-(pf?u)=V-(ppVd), 


1NUMA  is  the  name  of  our  model  and  is  an  acronym  for  the  Nonhydrostatic  Unified  Model  of  the  Atmosphere 
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where  u  =  ( u,w)T  is  the  velocity  field,  V  =  £z)T  is  the  gradient  operator,  <8»  is  the  tensor  product,  I 
is  the  rank-2  identity  matrix,  k  =  (0, 1)T  is  the  directional  vector  that  points  along  the  z  direction,  g  is  the 
gravitational  acceleration  and  P  is  the  pressure  obtained  from  the  equation  of  state 


P  =  Po 


(2) 


The  dynamical  viscosity  p  is  varied  among  the  test  cases.  Note  that  while  this  is  not  the  mathematically 
proper  form  of  the  true  Navier-Stokes  viscous  stresses,  it  is  sufficient  for  the  chosen  test  cases  as  shown 
in  Giraldo  and  Restelli  [36].  Additional  terms  requiring  definition  are  the  pressure  at  the  lower  boundary 
Pq  =  1  x  105  Pa,  the  gas  constant  R  =  cp  —  cv  and  the  specific  heats  at  constant  pressure  and  volume,  cp 
and  cv. 


3.  Discontinuous  Galerkin  method 


Giraldo  and  Restelli  [36]  describe  in  detail  the  discretisation  of  Eq.  (1)  for  the  DG  method.  Here  we 
outline  the  weak  formulation  for  the  sake  of  completeness.  Note  that  for  the  sake  of  brevity  the  analysis  in 
this  paper  was  conducted  using  the  weak  form  only,  although  both  strong  and  weak  forms  are  implemented 
in  the  NUMA  software  with  the  non-conforming  AMR  algorithm. 

To  describe  the  DG  method,  we  write  Eq.  (1)  in  vector  form 

W  +VF  =  S,q). 

where  q  =  (p,  UT,  @)T  is  the  solution  vector  with  U  =  pu  and  0  =  p9.  F(q)  is  the  flux  tensor  given  by 


F(q) 


/  u 

i:f-Pl  v(ppV^) 
\  6U  -  p\7@ 


(3) 


and  the  source  term  S(q)  given  by 


(4) 


Ne 

We  divide  the  computational  domain  into  a  set  of  non-overlapping  elements  fle  such  that  =  (J  Qe. 

e=l 

We  define  the  reference  element  I  =  [ — 1,  l]2  and  for  each  element  f le  there  exists  a  smooth  transformation 
such  that  I  =  Joe(fie).  Additionally,  if  =  Qe(x ,y)  and  I  =  /(£,??)  then  Pqb  :  (x,y)  — >  (G?7)  and 
1  :  (£,  77)  (x,  y).  We  employ  the  notation  x  =  ( x ,  y)  and  ^  =  (G  if).  The  Jacobian  of  this  transformation 

is  given  by  Joe  =  with  determinant  Joe. 

Let  ifk  be  a  basis  function  of  a  space  Pn(I)  of  polynomials  of  degree  N  or  lower  in  I  where  the  index  k 
varies  from  1  to  K  =  (N  +  l)2.  The  tensor  product  structure  of  I  allows  us  to  construct  such  a  basis  as 


V’fe  =  hi(t)hj(T)), 

where  {hi}fL0  is  a  basis  for  Pn{[~  1, 1])  and  index  k  is  uniquely  associated  with  the  pair  (i,j):  k  =  (i  +  1)  + 
(N  +  1  )j.  Let  G  be  the  Legendre-Gauss-Lobatto  (LGL)  points  defined  as  the  roots  of  (1  —  £,2)P'N(C)  =  0, 
where  Pn(0  is  the  Nth  order  Legendre  polynomial.  The  basis  functions  ht(f)  are  in  fact  the  Lagrange 
polynomials  associated  with  LGL  points  G-  Notice  that  associated  with  these  points  are  the  Gaussian 
quadrature  weights 


N(N+1)  \ 
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Wj  = 


Pn(  G) 


(5) 


Let  qjv  be  the  approximation  of  the  solution  vector  q  on  the  element  in  the  expansion  basis 


K 

qivOM)loe  =  {Fnl  (x))  q k{t),  e  =  l ,...,Ne, 

k= 1 


where  we  introduce  the  grid  points  xk  =  Fqb  ( (£i,rjj) )  and  the  grid  point  values  q k(t)  =  q jv  (xfc,t).  The 
computation  of  the  derivatives  of  qvr  gives 


dc In_ 
<9x 


(M) 


ne 


dqN 

dt 


(M) 


Qe 


K 


d^c  Fk  (x))  qfe^’ 


k= 1 


(^!  (x)) 


fc=l 


(6) 

(7) 


Concerning  the  computations  of  integrals,  the  expansion  defined  by  Eq.  (5)  yields 


qjv  CM)  dx 


(8) 


With  the  definitions  of  the  solution  expansion  and  the  operations  of  differentiation  and  integration  in 
place  we  can  now  formulate  a  DG  representation  of  Eq.  (1).  Here  we  consider  a  nodal  formulation  with 
inexact  integration  as  described  in  [36].  We  start  with  multiplying  Eq.  (1)  by  a  test  function  ip  and  integrating 
over  an  element  Qe: 

fa  V’  +  V  •  F(<&))  dCl  =  ip  S(q%)  dne,  (9) 

where  q'y  denotes  the  degrees  of  freedom  collocated  in  fie.  Applying  now  integration  by  parts  and  intro¬ 
ducing  the  numerical  flux  F*,  the  following  problem  is  obtained: 
find  qjv(-,  t)  G  VGG  such  that  VHe,  e  =  1, . . . ,  Ne 


iP 


<9q' 


N 


dt 


dfle  +  /  ipn-F*(qN)dTe-  /  Vip  ■  F(q%)  dfle  =  /  ipS(qeN)dfl , 


(10) 


Mip  G  L2(H)  . 

The  coupling  between  neighboring  elements  is  then  recovered  through  the  numerical  flux  F* ,  which  is 
required  to  be  a  single  valued  function  on  the  interelement  boundaries  and  the  precise  definition  of  which  is 
given  in  [36] . 

By  virtue  of  Eqs.  (6),  (7)  and  (8),  Eq.  (10)  can  be  written  in  the  matrix  form 


dqe 

dt 


Ke)V(«)  ^  (d)T  F{qe)  =  S(qe) 


(11) 


where  m“  =  (Me)_1  Ms’e  and  L Y  =  ( Me)~l  De.  Me,  Ms’e  and  De  are  local  mass, 
differentiation  matrices  given  by 

Mfa  =  whJne  (U  Shk,  D%k  =  whJne  (*„)  V<pk{xh),  M %  =  wshJne  (€/, 


boundary  mass  and 
)Shkn(xh), 


where  h,  k  =  1 , . . .  ,K,  6hk  is  the  Kronecker  delta,  $,k  =  (£$,  iy),  wk  =  wk  =  LOi  for  j  =  0  or  j  =  N, 

wk  =  ojj  for  i  =  0  or  i  =  N  and  Jq  is  the  Jacobian  of  transformation  Fqc  restricted  to  the  boundary  of  I. 
Note  that  this  equation  can  be  simplified  to  yield  the  following  semi-discrete  weak  form 


dqf 

dt 


(12) 


In  Eq.  (10)  we  notice  that  there  is  only  one  integral  (Te)  which  couples  the  neighboring  elements  together. 
It  is  this  boundary  integral  that  needs  to  be  modified  in  order  to  handle  non-conforming  AMR.  This  is 
explained  in  detail  in  Sec.  6. 
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4.  Conforming  vs  non-conforming  mesh 

The  first  question  that  needs  to  be  answered  when  constructing  the  AMR  algorithm  is  whether  non- 
conforming  elements  are  allowed  in  the  grid  -  that  is  whether  each  edge  can  be  shared  by  more  than  two 
elements.  We  can  restrict  ourselves  to  purely  conforming  ones,  where  all  edges  in  the  mesh  are  owned  by 
two  elements  only.  In  the  conforming  case,  the  entire  burden  of  handling  the  changing  mesh  falls  on  the 
mesh  adaptation  algorithm,  which  has  to  make  sure  that  the  grid  remains  conforming  after  adaptation. 
The  upside  of  this  approach  is  an  easy  communication  between  neighboring  elements.  Since  their  edges  are 
conforming,  the  AMR  does  not  introduce  any  additional  complication  to  the  DG  method  solver. 

In  contrast,  in  the  non-conforming  case  the  mesh  adaptation  algorithm  is  kept  simple:  we  divide  each 
element  marked  for  adaptation  into  a  predefined  number  of  children  elements.  In  our  case,  we  choose  to 
divide  a  2D  quadrilateral  element  into  four  children  elements  (from  here  on,  we  shall  refer  to  quadrilaterals 
as  quads).  This  leads  to  a  situation  where,  if  only  one  of  two  neighbor  elements  is  refined,  the  non-refined 
neighbor  shares  an  edge  with  two  children  elements.  This  requires  the  DG  solver  to  be  able  to  compute  the 
numerical  flux  on  such  a  non-conforming  edge.  This  approach  shifts  the  burden  from  the  mesh  adaptation 
algorithm  to  the  solver  side. 

In  this  paper  we  present  the  non-conforming  approach  for  a  2D  quad-based  mesh  for  the  DG  method 
2 .  We  believe  that  the  added  complication  and  increased  cost  related  to  the  computation  of  fluxes  through 
non-conforming  edges  is  more  than  made  up  for  by  the  simplified  element  refinement  algorithm. 

5.  Mesh  adaptation  algorithm 

5.1.  Forest  of  quad- trees 

We  adopt  the  forest  of  quad-trees  approach  proposed  by  [37].  We  generate  an  initial  coarse  mesh,  which 
has  to  represent  the  geometrical  features  of  the  domain.  In  Fig.  la  we  present  a  simple  two  element  initial 
mesh.  We  call  it  the  level  0  mesh,  where  each  element  is  a  root  for  a  tree  of  children  elements.  If  we 
decide  to  refine  element  1,  we  replace  this  element  with  four  children,  which  belong  to  the  level  1  mesh.  We 
represent  it  graphically  on  the  right  panel  of  Fig.  lb.  Active  elements  (a  set  of  elements  which  pave  the 
domain  entirely)  are  marked  in  blue.  Element  number  1  is  now  inactive,  replaced  by  the  four  newly  created 
elements  3,  4,  5  and  6. 

If  we  further  choose  to  refine  element  number  5,  and  thus  introduce  level  2  elements,  we  render  this 
element  inactive  and  replace  it  with  the  four  children  7,  8,  9  and  10.  Our  element  tree  is  presented  in  Fig. 
lc.  Now  elements  1  and  5  are  no  longer  active  while  the  active  elements  2,  3,  4,  6,  7,  8,  9  and  10  form  our 
mesh. 

5.2.  Space  filling  curve 

In  the  previous  subsection’s  examples  we  assigned  numbers  to  the  elements.  Those  numbers  serve  as 
element  unique  labels.  In  order  to  traverse  all  active  elements  in  the  mesh,  we  utilize  the  concept  of  the 
space  filling  curve  (SFC).  To  each  active  element  we  assign  an  index,  which  defines  the  element’s  position 
in  the  space  filling  array.  In  order  to  index  all  the  active  elements  in  the  mesh,  we  search  each  quad-tree  in 
the  forest  for  active  elements  (leaves)  starting  with  the  first  root  element  (element  1).  Since  it  is  inactive, 
we  move  to  its  first  child.  If  the  child  element  is  active,  we  include  it  into  the  space  filling  curve  and  move 
to  the  next  child;  if  it  is  inactive  we  recursively  traverse  the  sub-tree  rooted  in  this  element.  After  finishing 
the  search  of  one  quad-tree,  we  move  to  the  next  level  0  root  element. 

Figure  2  illustrates  the  space  filling  curve  concept.  Note  that  the  same  tree  traversing  technique  can 
yield  different  space  filling  curves,  depending  on  the  numbering  of  children  elements.  The  numbering  that 
produced  the  curve  in  Fig.  2a  imposes  a  row-major  order  of  children  elements,  while  the  curve  in  Fig.  2b 
was  generated  using  counter-clockwise  enumeration.  The  numbering  is  applied  recursively,  starting  from  the 


2It  should  be  noted  that  doing  this  for  the  continuous  Galerkin  method  is  also  possible  but  slightly  more  complicated.  We 
shall  report  on  this  in  a  follow-up  paper. 
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a) 


0 - 0 


2 


Figure  1:  Unstructured  grid  organized  into  a  forest  of  quad-trees. 


level  0  mesh  and  traversing  the  tree  to  its  fullest  depth.  Therefore  in  the  row-major  order  we  first  number 
the  elements  of  the  level  0  mesh  as  (1,2).  Next  we  move  one  level  down  and  enumerate  the  children  of 
element  1  (element  2  has  no  children)  starting  from  the  bottom  left  element  and  enumerating  the  children  in 
the  bottom  row  first,  then  move  to  the  second  row  and  enumerate  the  remaining  two  elements.  We  repeat 
the  recursive  procedure  until  we  enumerate  all  the  elements  on  all  levels. 

In  DG  methods  we  prefer  indexing  the  elements  in  such  a  way  that  adjacent  elements  are  placed  close 
to  each  other  in  the  space  filling  curve.  This  increases  data  locality,  which  in  turn  makes  the  computation 
of  fluxes  more  efficient.  Data  locality  is  particularly  important  in  parallel  implementations  of  the  AMR 
algorithm,  however  a  full  study  of  the  influence  of  element  numbering  on  the  efficiency  of  the  code  exceeds 
the  scope  of  this  paper. 

5.3.  Element  division  technique 

To  keep  the  adaptation  algorithm  simple,  and  the  non-conforming  face  handling  as  efficient  as  possible 
(see  Sec.  6.1),  we  require  that  each  side  of  an  element  to  be  refined  is  divided  in  a  2:1  size  ratio.  This  means 
that  we  split  a  parent  edge  into  two  children  edges  of  equal  length.  In  NUMA  we  use  general  quadrilaterals, 
possibly  with  curved  edges,  which  could  prove  splitting  elements  in  physical  (x,y)  space  difficult.  Therefore 
we  perform  the  element  splitting  in  the  computational  space  (£,17)  instead  (see  Fig.  3a). 

In  Sec.  3  we  defined  the  transformation  T  :  (x,y)  —>  (£,??)•  The  inverse  mapping  J7”1  :  (£,77)  — t  (a :,y)  is 
simply  the  expansion  of  a  variable  in  the  polynomial  basis  if): 

K 

q(xj,yj)  =^9,^(4,%), 

i— 1 

where  q  is  a  variable  defined  in  physical  space,  qi  is  the  nodal  value  of  the  variable  in  computational  space 
at  node  i  corresponding  to  a  basis  function  7/;,; .  0,  rjj)  represent  coordinates  of  the  j-th  nodal  point  in 
computational  space,  which  corresponds  to  a  point  ( Xj,yj )  in  physical  space. 

We  can  treat  the  x  and  y  coordinates  of  the  nodal  points  as  a  variable  across  the  element,  which  yields 
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a) 


Figure  2:  Two  different  variations  of  space  filling  curve. 


Figure  3:  (a)  General  shape  quad  division;  (b)  Projection  of  coordinates  from  parent  to  children  -  (red)  dots  represent  nodal 
values  of  coordinates  of  parent  element,  (green)  crosses  are  the  nodal  coordinates  we  seek  for  one  of  the  four  children  elements. 


the  expansions: 


K 

Xj  =  x(£j,r]j)  = 

i=  1 
K 

Vj  =  (13) 

i—1 


Figure  3b  presents  an  example  of  a  standard  element  with  16  nodal  points  (red  dots).  Each  point  has  a 
nodal  value  of  x ?  and  yf  and  is  characterized  by  coordinates  If  an  element  is  marked  for  refinement, 

we  split  the  standard  element  into  four  children  of  equal  size  (dashed  lines).  Next,  using  the  parent  nodal 
values  of  (a;*,  yi)  we  find  the  nodal  values  of  ( x ,  y)  in  the  children  elements  (green  crosses  in  Fig.  3).  Since  we 
have  (£f ,  r/f )  coordinates  of  parent  nodal  points,  then  the  division  of  the  parent  element  into  four  children 
is  very  simple.  We  can  easily  find  (£,  rf)  coordinates  for  children  nodal  points.  Following  [29]  we  write: 

ef“’  =  *■«." 

=  s ‘  Vi 

where  s  =  1/2  is  a  scale  factor  and  o^'1  and  are  the  offsets  corresponding  to  the  variables  £  and  r/ 
respectively  for  a  child  k.  For  the  lower  left  child  element  (depicted  in  Fig.  3b)  k  =  1,  o'c1  =  1/2  and 
ol)  =1/2.  For  different  children  the  offset  values  will  differ  in  sign. 
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Figure  4:  2:1  balanced  mesh  and  ripple  propagation  problem. 


We  seek  the  values  of  x  and  y  at  child  nodal  points  (green  crosses  -  here  only  one  child  is  represented). 
Knowing  the  value  of  (^c(fc) ;  7yc(fc) )  at  children  nodes  we  can  substitute  it  into  the  expansion  (13)  and  find 
the  coordinates  of  the  children  nodes  in  the  physical  space.  We  write 


c(k) 


K 


E'wmcV). 

i= i 

m  =  EwwtfV), 

3=  1 


which  can  be  represented  in  matrix  form  as 


c(fc)  _  T  c(fc)  P 

xi  ~  L ij  xj ’ 

c(k)  T  c(k )  p 

%  =  Wj  JVj, 


(15) 


(16) 


where  L ^  =  i/jj(^k ,Vik)  is  the  interpolation  matrix  which  holds  values  of  the  parent  basis  functions  tpj  at 
children  node  coordinates  (£ffe,  ?7ffc). 


5-4-  2:1  balance 

A  mesh,  where  each  edge  is  shared  by  at  most  three  elements,  is  called  a  2:1  balanced  mesh,  and  an 
edge  which  is  shared  by  two  elements  on  one  side,  and  one  element  on  the  other,  is  called  2:1  balanced 
edge.  When  using  the  term  ”2:1  balanced”,  we  assume  that  a  1:1  balanced  edge  (where  an  edge  is  shared 
by  only  one  element  on  each  side)  is  also  a  valid  member  of  the  2:1  balanced  mesh.  The  element  refinement 
procedure  described  in  Sec.  5.3  may  lead  to  a  situation,  where  an  element,  which  has  a  2:1  balanced  edge 
and  lies  on  the  refined  side  of  the  edge,  is  marked  for  refinement,  while  its  neighbor  across  this  edge  is  not. 
This  would  violate  the  2:1  balance  causing  the  edge  to  be  shared  by  a  total  of  four  elements  (one  on  one 
side,  and  three  on  the  other). 

An  example  of  such  a  situation  is  shown  in  Fig.  4a.  Elements  3,  4,  7,  8,  9,  10,  6,  2  form  a  2:1  balanced 
mesh.  Let  us  assume  that  element  number  6  was  marked  for  refinement.  This  would  cause  a  conflict,  where 
the  edge  shared  by  elements  2,  4  and  6  is  not  2:1  balanced  anymore.  In  order  to  avoid  this  situation,  a 
special  balancing  procedure  needs  to  be  introduced. 

In  the  situation  presented  in  Fig.  4a,  the  solution  of  the  2:1  balance  problem  is  to  refine  the  element 
2  first,  before  refining  6,  even  though  2  might  not  be  originally  marked  for  refinement.  This  would  not 
lead,  at  any  time  of  this  process,  to  violating  the  2:1  balance  rule.  One  might  imagine,  however,  that  for 
a  more  complex  mesh,  refining  element  number  2  might  also  cause  a  conflict  with  other  edges  owned  by  2. 
Consider  a  situation  depicted  in  Fig.  4b,  where  the  mesh  is  initially  2:1  balanced.  In  this  mesh  there  are 
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two  levels  of  refinement  predefined.  Element  19  is  a  Oth  level  element,  elements  3,  4,  11,  12,  13,  14  are  1st 
level  elements  and  7-10  and  15-18  are  2nd  level.  Let  us  assume  we  want  to  refine  element  number  18,  which 
would  create  3rd  level  children  elements.  This  would  create  a  conflict  with  element  13,  which  is  a  1st  level 
element.  Therefore  we  refine  element  number  13  to  the  2nd  level,  but  this  creates  a  conflict  with  the  0th 
level  element  number  19.  In  order  to  refine  18,  we  need  to  refine  element  19  first,  then  13,  and  finally  18 
in  order  to  keep  2:1  balance  at  all  times.  This  causes  some  regions  of  the  domain  to  be  more  refined  than 
required  from  the  refinement  criterion  -  we  did  not  initially  intend  to  refine  13  or  19.  Such  phenomenon  is 
called  the  ripple  effect ,  where  refinement  of  one  element  can  cause  an  entire  area  not  directly  neighboring 
the  element  in  question  to  be  refined  [38] . 

It  is  easy  to  show  that  in  the  2D  case  the  ripple  propagation  is  limited  by  the  lowest  level  element  in 
the  mesh.  In  a  2:1  balanced  mesh  the  level  difference  between  neighboring  elements  can  be  at  most  1.  The 
conflict  can  occur  only  when  refining  an  n-th  level  element  which  has  a  neighbor  of  level  (n  —  1).  Therefore 
we  need  to  bring  the  (n  —  1)  level  element  to  level  n  before  refining  the  original  element  to  level  (n  +  1).  If 
in  turn  the  (n  —  1)  level  element  causes  conflict  with  (n  —  2)  level  element,  we  need  to  follow  the  balancing 
procedure  recursively.  In  the  worst  case  scenario  we  will  propagate  the  ripple  down  to  Oth  level  element, 
which  by  definition  is  a  root  of  the  element  tree.  Therefore  by  refining  one  n-th  level  element  we  may  be 
forced,  in  the  worst  case,  to  refine  n  other  elements.  This  will  cause  4n  new  elements  to  be  created  in 
the  areas  possibly  not  indicated  by  the  refinement  criterion.  4n  is  typically  a  very  small  number  since  the 
simulations  shown  in  this  work  tend  to  have  values  n  <=  5. 

In  the  case  of  element  coarsening,  we  adopt  a  different  strategy.  If  coarsening  of  an  element  would  cause 
a  conflict  (consider  a  situation  in  Fig.  4b  after  refining  all  indicated  elements,  when  we  want  to  de-refine 
element  19),  we  do  not  perform  this  operation.  In  order  to  keep  the  2:1  balance  we  avoid  propagating  a 
de-refinement  ripple  to  higher  levels.  The  rationale  behind  this  strategy  is  that  it  is  better  to  have  more 
refined  elements  than  we  need,  rather  than  lack  the  resolution  in  the  areas  where  it  is  necessary.  This  way 
we  ensure  that  we  always  have  the  appropriate  level  of  refinement  as  indicated  by  the  refinement  criterion. 

5.5.  Refinement  criterion 

Our  focus  in  this  paper  is  the  AMR  machinery  and  its  particular  application  to  the  DG  method,  therefore 
we  use  a  very  simple  mesh  refinement  criterion.  First,  we  specify  the  quantity  of  interest  (QOI).  It  can  be 
either  a  primitive  variable  like  9 ,  u,  w  or  p,  or  an  expression  derived  from  those  variables  (i.e.  velocity 
magnitude,  absolute  value  of  temperature  fluctuation  etc.).  We  then  choose  a  refinement  threshold.  If  the 
maximum  value  of  the  QOI  within  an  element  falls  below  the  threshold,  the  element  is  marked  for  refinement. 
The  maximum  criterion  can  be  of  course  replaced  by  a  minimum  criterion.  It  is  also  worthwhile  to  consider 
a  gradient,  or  other  derivatives  of  primitive  variables,  as  a  QOI.  Throughout  this  paper  we  use  the  potential 
temperature  perturbation  as  the  quantity  of  interest. 

The  refinement  criterion  typically  need  not  be  evaluated  every  time-step,  but  rather  every  predefined 
number  of  steps,  depending  on  the  particular  problem.  After  each  evaluation  of  the  criterion  for  all  active 
elements  in  the  grid,  the  balancing  algorithm  is  run  to  eliminate  possible  conflicts. 

6.  Handling  of  non-conforming  edges 

The  previous  section  described  the  details  of  the  mesh  refinement  algorithm.  Here  we  will  focus  on  the 
implementation  of  such  an  algorithm  to  a  DG  solver.  Sec.  6.1  and  6.2  describe  the  computation  of  the  flux 
for  the  DG  method. 

6.1.  Projection  onto  2:1  edges 

In  the  DG  method  at  every  time-step  we  need  to  evaluate  the  numerical  flux  through  all  the  element 
edges.  When  allowing  non- conforming  elements  in  the  mesh,  one  needs  to  address  the  problem  of  projecting 
the  data  between  two  sides  of  a  non-conforming  edge.  In  our  case  the  non-conformity  is  limited  to  2:1 
balanced  edges,  which  makes  the  data  exchange  slightly  easier  than  in  a  general  non-conforming  case. 
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Figure  5:  Projection  onto  non-conforming  edges:  a)  scatter  from  left  parent  edge  to  two  children  edges  and  b)  gather  from  two 
children  edges  to  parent  edge. 


Consider  the  situation  shown  in  Fig.  5a.  The  variable  qL  from  a  parent  edge  is  projected  onto  two  children 
edges  and  becomes  qL1  and  qL2.  In  order  to  perform  this  scatter  operation  we  design  two  projection  matrices 
PS1  and  PS2  such  that 


qL1  =  P  slqL 

qL2  =  pS2qL_  (17) 

Similarly  for  the  gather  operation  we  need  the  matrices  PG1  and  PG2  which  satisfy 

qL  =  P  G1qL1  +  P  G2qL2 

The  projection  matrices  are  constructed  using  the  integral  projection  technique  [29],  derived  for  different 
size  ratios  and  different  polynomial  orders  in  neighboring  elements.  In  Appendix  A  we  present  the  outline 
of  the  method  tailored  to  2:1  balanced  edges. 

Due  to  the  fact  that  we  limit  our  non-conforming  edges  to  2:1  ratio  only,  the  scatter  and  gather  projection 
matrices  are  the  same  for  all  edges  and  need  to  be  computed  only  once,  which  makes  the  algorithm  quite 
simple  and  very  efficient. 

6.2.  Flux  computation 

The  flux  computation  algorithm  applied  in  NUMA  to  handle  non-conforming  edges  relies  on  the  projec¬ 
tion  technique  described  in  Sec.  6.1  and  Appendix  A.  First,  we  scatter  the  variables  from  parent  to  children 
edges  using  projection  matrices  PS1  and  PS2.  This  way  we  have  the  necessary  information  on  both  sides 
of  children  edges  just  as  in  the  regular  conforming  case.  We  compute  the  numerical  flux  on  children  edges 
using  the  Rusanov  flux  (see  [36]  for  details)  and  gather  it  back  to  the  parent  edge  using  matrices  PG1  and 
PG2.  Finally,  the  flux  is  applied  on  both  parent  and  children  edges.  Note  that  you  can  replace  the  Rusanov 
flux  with  any  other  Riemann  solver. 

This  algorithm  ensures  that  the  amount  of  numerical  flux  leaving  one  element  is  equal  to  the  flux  received 
by  the  children  elements  on  the  other  side  of  the  non-conforming  edge.  It  is  worth  noting,  however,  that 
since  we  are  dealing  with  the  DG  method,  we  allow  a  discontinuity  between  variables  (and  therefore  flux) 
at  the  children  side  of  the  interface.  The  projection  represents  the  flux  from  both  children  elements  as  one 
smooth  function  defined  on  the  parent  side  of  the  interface.  The  parent  flux  is  not  point-wise  identical  to 
the  children  fluxes,  but  we  constrain  the  integral  of  the  flux  over  the  edge,  which  guarantees  conservation. 

6.3.  2D  projection  between  parent  and  children  elements 

Once  an  element  is  refined  (derefined),  its  data  must  be  projected  onto  its  children  (parent)  elements. 
In  order  to  perforin  these  two  operations  we  will  use  the  2D  version  of  the  integral  projection  technique 
discussed  in  Sec.  6.1.  Figure  6  shows  schematically  the  projections  from  parent  element  with  coordinates 
(£,77)  £  [ — 1,  l]2  to  four  children,  each  with  separate  coordinates  (z^  ,£3  )  £  [ — 1,  l]2,  where  k  =  1,  ...,4 
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Figure  6:  Projection  between  parent  and  children  elements  -  2D  extension  of  integral  projection  for  non-conforming  edges. 

enumerates  the  children  elements.  For  this  projection  we  construct  the  scatter  matrix  P^d-  The  inverse 
operation  is  performed  using  the  gather  matrix  P|d  •  Detailed  construction  of  both  matrices  is  described  in 
Appendix  A.  Note  that  while  the  integral  projection  method  works  well  for  conserved  quantities,  it  may  not 
be  appropriate  for  all  variables  in  the  problem.  It  is  sometimes  better  to  interpolate  or  recompute  certain 
quantities,  if  possible.  An  example  of  such  a  situation  is  the  gravity  direction  vector,  which  is  defined 
completely  by  the  input  to  the  simulation,  therefore  can  be  recomputed  for  each  new  element  in  the  mesh. 
In  both  cases  presented  in  this  paper  the  gravity  direction  was  k  =  (0, 1)  and  in  some  cases  the  projection 
operation  caused  inconsistencies  on  the  order  of  the  round-off  error,  which  in  turn  adversely  affected  the 
solution. 


7.  Test  cases 

In  order  to  test  the  AMR  algorithm  we  run  a  selection  of  cases  from  the  set  presented  in  [36].  The  set 
consists  of  seven  tests  widely  used  for  benchmarking  of  non-hydrostatic  dynamical  cores  of  numerical  weather 
prediction  codes.  For  the  purpose  of  benchmarking  the  AMR  capabilities  of  our  code  we  have  picked  two 
scenarios.  The  density  current  and  rising  thermal  bubble  cases  show  the  performance  of  the  AMR  algorithm 
on  a  rapidly  changing  mesh.  For  both  test  cases  we  compare  the  adaptively  refined  simulation  with  a 
uniformly  refined  simulation.  Both  cases  are  described  in  detail  in  the  aforementioned  paper.  Here  we 
outline  them  for  completeness. 


7.1.  Case  1:  Density  current 

The  case  was  first  published  in  [39]  and  consists  of  a  bubble  of  cold  air  dropped  in  a  neutrally  stratified 
atmosphere.  The  bubble  eventually  hits  the  lower  boundary  of  the  domain  (no  flux  wall)  and  moves  hori¬ 
zontally  shedding  Kelvin-Helmholtz  rotors.  In  order  to  obtain  the  grid-converged  solution  we  apply  artificial 
viscosity  yi  =  75m2/s(see  [39]). 

The  initial  condition  is  defined  in  terms  of  potential  temperature 


6'  = 


for  r  >  rc 
for  r  <  rc  ’ 


(18) 


where  9C  =  — 15K,  r  =  y  (^x xXc^  +  and  rc  =  1.  The  domain  is  defined  as  ( x,z )  G  [0,25600]  x 

[0,6400]  m  with  t  €  [0,900]  s  and  the  center  of  the  bubble  is  at  (xc,zc)  =  (0,3000)  m  with  the  size  of  the 
bubble  defined  by  ( xr,zr )  =  (4000,2000)  m.  The  boundary  conditions  for  all  four  boundaries  are  no-flux 
walls.  The  velocity  field  is  initially  set  to  zero  everywhere. 


12 


(a) 


Figure  7:  Snapshots  of  the  solution  and  dynamically  adaptive  mesh  for  6t  =  1.0  at  different  simulation  times:  (a)  Is,  (b)  300s, 
(c)  600s,  and  (d)  900s. 


7.2.  Case  2:  Rising  thermal  bubble 

In  this  test  case  a  warm  bubble  rises  in  a  constant  potential  temperature  atmosphere  ( 0  =  300A').  As  it 
rises,  it  deforms  until  it  forms  a  mushroom  shape.  Initially,  the  air  is  at  rest  and  in  hydrostatic  balance.  The 
initial  potential  temperature  perturbation  is  given  by  Eq.  (18)  with  6C  =  0.5K  and  rc  =  250m.  The  domain 
has  dimensions  (x,z)  €  [0, 1000]m  x[0, 1000]m  and  the  bubble  is  positioned  at  (xc,zc)  =  (500,350)m.  The 
boundary  conditions  for  all  sides  are  no-flux.  The  simulation  runs  until  t  =  700s. 

8.  Results 

8.1.  Case  1  -  Density  Current 

The  density  current  simulation,  defined  in  Sec.  7.1,  was  initialized  with  a  coarse  mesh  consisting  of  four 
elements  (4x1  grid).  We  chose  polynomial  order  N  =  5  within  each  element.  The  mesh  was  then  refined 
uniformly  to  a  specified  maximum  level  of  refinement.  We  chose  the  maximum  level  to  be  equal  to  5,  which 
allowed  for  a  uniformly  fully  refined  mesh  of  128  x  32  elements.  This  corresponds  to  an  effective  resolution 
of  40  m,  which  is  slightly  below  the  resolution  shown  in  [36]  to  yield  a  converged  solution.  By  effective 
resolution  we  mean  the  average  distance  between  the  nodal  points  within  the  element.  The  initial  condition 
was  generated  for  this  fully  refined  mesh,  which  formed  an  initial  input  to  all  the  Case  1  simulations. 
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Figure  8:  Element  count  as  a  function  of  simulation  time  for  different  Qt  threshold  values. 


(«) 


(&) 


Figure  9:  Snapshots  of  the  solution  and  dynamically  adaptive  mesh  at  T  =  900s  for  two  different  thresholds  :  (a)  0t  =  0.001 
and  (b)  0t  =  4.0. 


In  order  to  obtain  a  reference  solution,  we  run  the  fully  refined  mesh  without  the  AMR  algorithm 
3.  Next,  we  select  four  different  refinement  thresholds  for  the  potential  temperature  perturbation  9t  = 
[0.001,  0.1,  1.0,  2.0,  4.0].  For  each  9t  the  AMR  algorithm  adapts  the  mesh  to  the  initial  condition  and 
continues  modifying  the  grid  every  one  second  of  simulation  time  so  that  the  areas  of  the  domain  where 
8  >  6t  are  fully  refined,  and  the  remaining  elements  are  coarsened  to  a  minimum  resolution  allowed  by  the 
2:1  balance  condition.  Figure  7  shows  snapshots  of  the  mesh  at  different  times  for  9t  =  0.1. 

Figure  8  shows  the  total  number  of  elements  in  the  grid  for  different  values  of  9t  over  the  simulation 
time.  Notice  the  different  behavior  of  the  element  count  for  high  and  low  threshold  values.  The  number 
of  elements  for  high  thresholds  tend  to  level  off  more  quickly  as  the  simulation  progresses.  Low  thresholds 
cause  the  element  count  to  steadily  rise  with  time,  at  least  for  the  time  frame  considered  for  this  test  case. 
This  is  due  to  the  fact  that  the  low  9t  criterion  not  only  requires  the  algorithm  to  refine  the  area  around 
the  moving  structure,  but  also  the  near-wall  wake,  where  the  temperature  is  slightly  decreased  after  the 
passing  of  the  cold  front.  For  further  analysis,  we  define  the  element  ratio  (ER  =  ^=~)  to  be  the  ratio  of 
the  number  of  elements  in  the  reference  simulation  ( Nref  =  4096)  to  the  time  average  number  of  elements 
in  the  AMR  simulations  ( Ne ).  High  ER  corresponds  to  high  threshold  AMR  simulations. 


3We  must  use  a  high-resolution  simulation  as  the  reference  solution  because  this  test  case  has  no  analytic  solution. 
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Table  I:  Front  location  ( L ),  minimum  potential  temperature  perturbation  (0rrrrn)  and  simulation  runtime  (Ta). 


et 

L  [m] 

RK35 

$  min  [K-] 

Ts  [h] 

L  [m] 

BDF2 
@min  [K] 

Ts  [li] 

L  [m] 

ARK2 

@min  [K] 

Ts  [h] 

ref. 

14758.87 

-8.90603 

60.94 

14758.70 

-8.90630 

6.13 

14758.87 

-8.90603 

9.68 

0.001 

14758.79 

-8.90606 

9.78 

14758.71 

-8.90650 

1.08 

14758.79 

-8.90609 

1.57 

0.1 

14758.81 

-8.90607 

7.85 

14758.74 

-8.90661 

0.89 

14758.81 

-8.90610 

1.27 

1.0 

14758.86 

-8.90619 

6.24 

14758.89 

-8.90688 

0.75 

14758.86 

-8.90622 

1.03 

2.0 

14758.89 

-8.90608 

5.53 

14758.98 

-8.90690 

0.71 

14758.89 

-8.90611 

0.92 

4.0 

14758.80 

-8.90555 

4.11 

14759.10 

-8.90648 

0.57 

14758.80 

-8.90558 

0.71 

8.1.1.  Accuracy  Analysis 

In  Fig.  9  two  different  AMR  simulation  results  are  presented.  The  top  picture  shows  the  potential 
temperature  perturbation  field  for  a  low  threshold  ( 6t  =  0.001)  simulation,  while  the  bottom  plot  shows  a 
high  threshold  ( 9t  =  4.0)  result.  The  main  features  of  the  solution  are  the  same  in  both  cases.  Even  though 
in  the  high  threshold  simulation  the  resolved  region  does  not  encompass  the  entire  structure,  the  position 
of  the  front  and  the  rotor  structure  looks  identical  to  the  low  threshold  case.  The  difference  can  be  noted  in 
the  wake  of  the  front  and  the  far  field.  The  high  threshold  mesh  does  not  capture  the  wake  well,  therefore 
this  feature  is  not  represented  in  the  bottom  plot. 

Table  I  confirms  that  all  simulations  reproduce  the  main  features  of  the  solution  well.  The  simulations 
were  run  using  an  explicit  3rd  order  5  step  Runge-Kutta  method  (RK35),  an  IMEX  second-order  backward 
difference  formula  (BDF2)  method  and  an  IMEX  second-order  Additive  Runge-Kutta  method  (ARK2)  -  see 
[40]  for  details  of  these  time-integrators.  The  front  position  was  calculated  as  the  location  of  —  IK  isotherm  at 
the  bottom  wall.  The  values  from  the  computational  grid  were  interpolated  using  the  visualization  package 
Paraview  to  a  very  fine  (0.01m  resolution)  uniform  grid,  and  the  location  of  the  isotherm  was  measured 
on  that  uniform  grid.  The  front  position  error  does  not  exceed  lm  for  all  the  methods.  Interestingly,  the 
front  location  for  RK35  and  ARK2  methods  are  identical.  ARK2  also  closely  follows  RK35  with  regard 
to  the  minimum  potential  temperature  perturbation,  which  suggests  that  ARK2  delivers  superior  accuracy 
to  BDF2.  The  BDF2  results  for  both  the  front  location  and  minimum  potential  temperature  perturbation 
gives  similar,  but  slightly  different  results  than  RK35  and  ARK2. 


Figure  10:  L2  error  norms  for  AMR  simulations  using  different  time  integrators:  blue  line  with  circles  shows  RK35  result; 
green  line  with  squares  shows  BDF2  result;  red  line  with  x  markers  shows  ARK2  result. 


Figure  10  shows  the  L2  normalized  error  norms  for  all  AMR  simulations  plotted  against  the  element  ratio. 
The  norm  was  computed  by  comparing  the  potential  temperature  perturbation  field  of  the  AMR  simulations 
with  a  fully  refined  reference  case  with  the  same  time-integration  method  (AMR  explicit  compared  with 
reference  explicit  -  blue  line  with  circle  markers;  AMR  BDF2  compared  with  reference  BDF2  -  green  line 
with  square  markers;  AMR  ARK2  compared  with  reference  ARK2  -  red  line  with  cross  markers)  using  the 
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following  formula: 


^(Q,  q) 


£(<?*- <£)2 

e,k 

E(^)2 

e,k 


(19) 


where  Q  is  the  reference  solution,  q  is  the  AMR  solution,  e  is  the  index  traversing  the  elements,  and  k 
enumerates  the  nodal  points  within  each  element. 

The  error  for  all  the  time  integration  methods  is  the  same,  which  shows  that  AMR  equally  impacts  the 
simulation  accuracy  regardless  of  the  time  integration  scheme.  For  low  ER  simulations  the  L2  error  is  very 
small  and  stays  below  ICE6.  For  high  ER  the  error  grows  to  10-4.  This  indicates  that  for  low  ER  cases 
the  entire  domain  is  adequately  resolved,  while  for  high  ER  simulations  the  unresolved  far  field  impacts  the 
global  accuracy  of  the  solution. 

Overall,  the  error  analysis  shows  that  AMR  can  deliver  an  accurate  result,  with  the  level  of  accuracy 
dependent  on  the  refinement  threshold.  Even  the  high  threshold  AMR  simulations  can  represent  the  main 
features  of  the  solution  well.  Regarding  IMEX  methods,  ARK2  seems  to  deliver  more  accurate  solutions 
(i.e.  the  solution  which  is  closer  to  the  explicit  RK35). 


8.1.2.  Performance  Analysis 

We  examine  the  performance  of  three  time  integration  methods  in  conjunction  with  an  adaptive  mesh: 
the  explicit  RK35,  IMEX  BDF2,  and  IMEX  ARK2  methods.  While  the  explicit  method  proves  to  be  simpler 
to  analyze,  IMEX  is  the  method  of  choice  for  all  real-world  applications,  because  it  relaxes  the  explicit  time 
step  constraint.  For  explicit  simulations  the  time  step  (At  =  0.01s)  was  ten  times  smaller  than  for  the  IMEX 
simulations.  Figure  lid  reflects  that  difference,  as  the  IMEX  simulations  run  nearly  10  times  faster  than 
their  explicit  counterparts.  The  fastest  method  is  the  BDF2,  while  the  ARK2  is  just  slightly  slower  running 
with  the  same  time  step  (although  the  ARK2  method  can  use  a  larger  time-step  than  BDF2  due  to  a  larger 
stability  region).  The  Courant  number  for  the  IMEX  simulations  was  1.6. 

Since  the  AMR  simulations  have  a  lower  element  count,  we  expect  them  to  run  proportionally  faster 
than  a  fully  refined  reference  simulation.  The  theoretical  ideal  speed-up  is  equal  to  the  ratio  of  the  number 
of  elements  in  the  reference  case  to  an  average  number  of  elements  in  the  AMR  simulations.  We  expect 
that  the  AMR  algorithm  incurs  an  overhead  for  the  evaluation  of  the  refinement  criterion,  projection  of 
the  solution  between  dynamically  changing  meshes  and  computation  of  non-conforming  fluxes,  therefore  the 
actual  speed-up  curve  should  never  exceed  the  ideal  profile.  The  difference  between  the  ideal  and  actual 
speed-up  curves  is  a  measure  of  the  AMR  overhead. 

Figure  11  shows  the  speed-up  of  the  AMR  algorithm  for  different  ER  values  for  (a)  explicit  RK35,  (b) 
IMEX  BDF2  and  (c)  all  three  time  integration  methods.  The  black  solid  line  without  markers  represents  the 
ideal  theoretical  speed-up.  The  speed-up  of  explicit  AMR  simulations  (blue  solid  line  with  circular  markers 
in  Fig.  11a)  is  nearly  ideal,  which  indicates  that  the  cost  of  performing  all  the  operations  accredited  to 
AMR  is  negligible  compared  to  the  cost  of  time  integration.  Since  in  this  case  the  refinement  criterion  was 
evaluated  every  second  (i.e.,  every  100  time-steps)  we  plot  the  speed-up  of  the  simulation  with  refinement 
check  every  time-step  in  the  blue  dashed  line.  The  overhead  due  to  frequent  criterion  evaluation  shows 
on  the  plot,  but  is  still  very  small.  This  also  indicates  that  the  leading  cost  in  the  AMR  overhead  is  the 
evaluation  of  the  criterion,  since  when  we  take  away  the  majority  of  that  cost  (solid  blue  line  with  circular 
markers)  the  speed-up  curve  overlaps  with  the  ideal  line.  The  other  components  of  the  overhead  (mesh 
manipulations,  data  projections,  non-conforming  flux  computations)  have  a  negligible  cost. 

In  order  to  validate  the  choice  of  refinement  frequency,  we  plot  in  Fig.  12  the  time  history  of  the  element 
count  for  two  extreme  refinement  thresholds  using  refinement  every  time-step  (black  lines)  and  every  one 
second  (100  time  steps  for  explicit,  or  10  time  steps  for  IMEX  simulations).  Both  curves  overlap  -  only 
small  differences  are  visible  for  the  high  threshold  simulation  near  time  T  =  680s,  which  can  be  attributed 
to  ’’blinking”,  that  is  refining  and  de-refining  the  same  mesh  location  every  time  step.  In  this  case  the  less 
frequent  refinement  works  to  our  advantage  by  removing  this  unwelcome  feature. 
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Figure  11:  Speed-up  for  (a)  RK35,  (b)  BDF2  and  (c)  all  time-integrators.  Black  solid  line  without  markers  indicates  ideal 
speed-up.  Blue  line  with  circles  marks  the  RK35  results,  green  line  with  squares  marks  BDF2  and  red  line  with  crosses  marks 
ARK2.  (a)  Dashed  line  shows  the  speed-up  in  the  case  where  the  refinement  criterion  is  evaluated  at  every  iteration,  (b) 
The  dashed  line  shows  the  speed-up  with  a  constant  number  of  solver  iterations,  (d)  Wall  clock  time  for  all  time  integration 
methods. 


In  Fig.  lib  the  green  solid  line  with  square  markers  represents  simulations  with  IMEX  BDF2  time 
integration.  Clearly  the  overhead  incurred  by  adaptive  simulations  is  much  greater  than  in  the  explicit 
case.  The  speed-up  curve  has  a  variable  slope,  which  indicates  that  the  overhead  changes  with  the  choice 
of  refinement  threshold.  The  speed-up  of  AMR  simulations  with  the  IMEX  ARK2  method  (red  line  with 
cross  markers  in  Fig.  11c)  is  much  better  than  the  BDF2,  but  with  bigger  overhead  than  the  RK35.  Also 
the  slope  for  ARK2  seems  to  be  less  variable  than  for  BDF2. 

At  the  heart  of  our  IMEX  methods  is  the  GMRES  iterative  solver.  In  Fig.  13  we  show  the  average  number 
of  GMRES  iterations  per  time  step  as  a  function  of  ER  for  both  BDF2  (green  line  with  square  markers)  and 
ARK2  (red  line  with  cross  markers).  The  point  corresponding  to  ER  =  1  is  the  reference  simulation.  For 
BDF2  the  average  iteration  count  grows  with  ER,  while  for  ARK2  it  remains  more  or  less  constant.  This 
can  explain  the  difference  in  the  overhead  incurred  by  AMR  with  those  two  time  integration  methods.  To 
investigate  the  matter  further  we  run  the  BDF2  simulations  with  a  prescribed  number  of  GMRES  iterations. 
The  result  of  this  exercise  is  depicted  by  the  dashed  line  in  Fig.  lib.  The  nearly  ideal  speed-up  is  regained 
by  introducing  a  constant  number  of  GMRES  iterations  for  each  simulation.  Of  course  such  a  constraint 
is  artificial,  as  the  GMRES  algorithm  automatically  determines  the  number  of  iterations  needed  to  satisfy 
the  solution  accuracy.  This  means,  however,  that  it  is  indeed  the  variable  GMRES  iteration  count  that  is 
preventing  the  AMR  BDF2  simulations  from  achieving  a  good  speed-up. 

To  investigate  the  reason  for  the  higher  average  number  of  iterations  per  time  step  in  the  AMR  BDF2 
simulations  we  plot  the  time  history  of  the  iteration  count  for  both  the  reference  and  9t  =  0.001  simulations 
in  Fig.  14a.  The  top  plot  shows  the  GMRES  iterations  for  the  reference  simulation  -  the  number  oscillates 
between  7  and  8  every  time  step.  The  bottom  plot  represents  the  AMR  simulation  and  reveals  frequent 
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Figure  12:  Element  number  time  history  for  two  different  refinement  threshold  settings  (0/  =  0.001  top  line;  0t  =  4.0  bottom 
line)  evaluating  the  criterion  every  time  step  (black)  and  every  100  time  steps  (green)  with  RK35. 


Figure  13:  Average  number  of  GMRES  iteration  per  time  step  for  reference  and  adaptive  simulations. 


spikes  in  the  iteration  count.  A  closer  look  at  the  iteration  count  compared  with  the  changes  in  the  number 
of  elements  in  the  mesh  (Fig.  14b)  shows  that  the  spike  in  the  iteration  count  occurs  whenever  there  is  a 
change  in  the  mesh.  This  clearly  implies  that  the  AMR  algorithm  incurs  an  additional  overhead  with  the 
IMEX  BDF2  method  because  of  an  increased  number  of  GMRES  iterations. 

On  the  other  hand,  such  spikes  do  not  occur  for  ARK2.  The  iteration  count  history  in  Fig.  14c  is  very 
similar  for  both  the  reference  and  9t  =  0.001  simulations,  however  much  more  variable  compared  with  the 
BDF2  reference.  The  reason  for  the  different  behavior  is  that  even  though  both  BDF2  and  ARK2  use  the 
same  iterative  solver,  the  system  solved  by  GMRES  actually  differs  between  the  two  methods.  While  not 
exhaustive,  this  result  does  seem  to  show  that  BDF2  is  less  robust  than  ARK2  with  respect  to  the  projections 
of  the  solution  between  meshes  that  is  introduced  by  the  AMR  algorithm.  One  simple  explanation  is  that 
BDF2  is  a  multi-step  method  (requires  the  solution  at  two  time-levels  in  addition  to  the  right-hand-side 
vector)  while  ARK2  is  a  single-step  multi-stage  method  (which  only  requires  the  solution  at  one  previous 
time  level  and  all  stages  are  built  directly  from  this). 


8.1.3.  Mass  conservation 

An  important  measure  of  the  quality  of  the  discretization  is  the  mass  conservation.  In  order  to  show 
that  our  non-conforming  AMR  implementation  does  conserve  mass  as  well  as  a  conforming  DG  method,  we 
investigate  the  mass  conservation  error.  We  define  the  mass  conservation  error  as: 


M(t) 
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Figure  14:  Time  history  of  GMRES  iterations  (k)  for  reference  (top  figure  on  each  panel)  and  0t  =  0.001  (bottom  figure  on  each 
panel)  simulations  for  (a)  BDF2  and  (c)  ARK2  simulations.  Panel  (b)  contains  the  close-up  of  the  time  history  of  k  between 
t  =  480s  and  500s  for  the  BDF2  simulation  (top  figure)  and  the  time  history  of  the  element  count  ( Ne  on  the  bottom  figure). 


where  m[t)  =  fn  p(t)dfl  is  the  total  mass  in  the  system  at  a  time  t.  The  mass  conservation  error  is  therefore 
a  normalized  measure  of  the  mass  loss  in  the  system  over  the  simulation  time.  The  plots  of  M  for  the 
reference  simulation  and  two  different  AMR  simulations  are  presented  in  Figure  15.  The  top  three  panels 
present  the  mass  conservation  error  as  a  function  of  time  for  the  reference  simulation  (Fig.  15a)  and  two 
AMR  simulations  with  different  refinement  thresholds  (Fig.  15b  and  15c).  The  M  profile  of  the  reference 
simulation,  after  the  initial  transient  behavior,  is  flat  for  all  the  time  integration  methods  and  stays  below 
10~13.  The  profiles  for  both  AMR  simulations  are  bounded  from  above  by  the  reference  mass  conservation 
level.  We  notice  also  that  M  varies  slightly  over  time  for  the  AMR  simulations.  In  panel  (d)  we  plot  the 
element  count  ( Ne )  for  the  reference  simulation  and  two  AMR  simulations.  Note  that  the  level  of  mass 
conservation  on  panels  (a),  (b)  and  (c)  is  correlated  with  the  element  count  for  reference,  0t  =  0.001  and 
6t  =  4.0  simulations.  The  mass  conservation  error  and  the  element  count  are  both  constant  for  the  reference 
simulation.  For  the  AMR  simulations  M  is  growing  whenever  there  is  an  increase  in  Ne.  This  can  be 
explained  by  the  fact  that  more  integration  points  mean  more  roundoff  errors  in  mass  computations,  which 
add  up  to  create  a  larger  mass  conservation  error  although  it  remains  far  below  the  conservation  error  of 
the  high-resolution  uniform  simulation. 

8.1. 4-  Optimization  remarks 

The  NUMA  code  is  written  in  the  Fortran90  language,  which  relies  heavily  on  the  compiler.  All  presented 
analysis  was  conducted  with  compiler  optimization  turned  off  in  order  to  investigate  the  algorithm  itself. 
Figure  16a  presents  the  decrease  in  runtime  due  to  using  03  (Intel)  compiler  optimization  setting  for  both 
explicit  and  IMEX  simulations.  Use  of  optimization  provides  almost  an  order  of  magnitude  faster  code, 
which  is  a  considerable  benefit  for  any  application.  Figure  16b  shows  the  speed-up  analysis  for  the  adaptive 
simulations  using  compiler  optimization.  Both  explicit  and  IMEX  speed-up  curves  (dashed  lines)  are  above 
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Figure  15:  Mass  conservation  error  for  Case  1  for  (a)  reference  simulation,  (b)  0t  =  0.001  AMR  simulation  and  (c)  0t  =  4.0 
AMR  simulation  for  three  time  integration  methods.  Panel  (d)  shows  the  time  history  of  the  element  count  for  all  three 
simulations  using  all  three  methods.  Results  from  different  simulations  with  the  same  time-integration  method  overlap,  which 
is  expected. 


Table  II:  Timings  (in  seconds  of  runtime)  and  percentage  breakdown  of  different  AMR  components  for  Case  1. 


RK35 

et  =  o.ooi  et  =  4.o 

BDF2 

8t  =  0.001  8t  =  4.0 

ARK2 

et  =  0.001  Bt  =  4.0 

total 

5292  (100%) 

2128  (100%) 

643.6  (100%) 

297.0  (100%) 

967.8  (100%) 

403.1  (100%) 

volume  integrals 

2876.1  (54.3%) 

1183.3  (55.6%) 

274.46  (42.6%) 

135.2  (45.5%) 

362.9  (37.5%) 

153.7  (38.1%) 

face  integrals 

507.67  (9.59%) 

182.19  (8.56%) 

66.59  (10.3%) 

28.07  (9.45  %) 

80.1  (8.29%) 

29.58  (7.34%) 

non-conforming  faces 

116.79  (2.21%) 

90.94  (4.27%) 

21.33  (3.31%) 

20.98  (7.06%) 

23.65  (2.44%) 

20.14  (4.99%) 

AMR: 

6.030  (0.11%) 

2.933  (0.14%) 

6.291  (0.98%) 

3.154  (1.06%) 

6.025  (0.62%) 

3.037  (0.75%) 

criterion  evaluation 

3.463 

1.422 

3.517 

1.468 

3.527 

1.506 

mesh  manipulation 

0.051 

0.049 

0.05 

0.057 

0.051 

0.063 

data  projection 

0.965 

0.581 

1.247 

0.77 

0.965 

0.599 

other 

1.412 

0.787 

1.450 

0.833 

1.454 

0.843 

the  theoretical  ideal  speed-up  line.  The  IMEX  time  integrators  benefitted  more  from  optimization,  as  they 
provide  higher  speed-ups  than  the  explicit  method  (even  though  it  still  suffers  from  a  variable  number  of 
GMRES  iterations  in  the  BDF2  case) .  The  biggest  beneficiary  (speed-up  wise)  of  the  compiler  optimization 
is  the  ARK2  method.  Also,  the  slope  of  all  optimized  curves  is  higher  than  the  ideal  slope.  This  indi¬ 
cates  that  the  original  assumption  about  the  work  load  being  proportional  to  the  number  of  elements  does 
not  necessarily  hold.  The  theoretical  performance  model  of  the  algorithm  does  not  incorporate  possible 
optimizations  that  occur  at  the  compiler  level.  We  attribute  those  super-speed-ups  to  the  optimization  of 
memory  accesses  (e.g.,  prefetching,  etc.),  which  plays  a  significant  role  when  the  problem  size  gets  smaller 
(and  fits  in  cache)  due  to  the  use  of  the  AMR  algorithm. 

8.1.5.  AMR  cost  breakdown 

In  Table  II  we  present  the  absolute  runtime  (in  seconds)  and  a  percentage  share  in  the  total  simulation 
time  of  selected  parts  of  the  code  for  three  time  integration  methods  and  two  refinement  thresholds.  We 
list  the  total  runtime,  the  cost  of  evaluating  the  volume  integrals  (the  fle  integrals  in  Eq.  (10)  apart  from 
the  one  containing  the  time  derivative),  conforming  and  non-conforming  face  integrals  (both  contribute  to 
the  Te  integral  in  Eq.  (10))  and  the  time  spent  in  the  AMR  subroutine.  We  further  provide  a  breakdown 
of  how  much  time  was  spent  doing  particular  AMR  tasks  like  the  criterion  evaluation,  mesh  manipulation 
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Figure  16:  (a)  Wall  clock  time  for  explicit  (blue  circles),  IMEX  BDF2  (green  squares)  and  IMEX  ARK2  (red  crosses)  simulations 
with  (dashed  line)  and  without  (solid  line)  compiler  optimization;  (b)  Speed-up  of  explicit  (blue  circles),  IMEX  BDF2  (green 
squares)  and  IMEX  ARK2  (red  crosses)  AMR  simulations  with  (dashed  lines)  and  without  (solid  lines)  compiler  optimization. 


or  data  projection.  The  category  ’’other”  contains  the  operations  like  recomputing  quantities  such  as  the 
mass  matrix,  gravity  vector  etc.  which  were  not  optimized  for  a  changing  mesh.  Instead  of  recomputing 
those  values  for  new  elements  only,  we  do  it  for  all  the  elements  in  the  new  mesh.  The  overall  time  spent 
in  those  subroutines  is  negligible  compared  to  the  total  simulation  time,  therefore  the  optimization  would 
yield  a  very  small  gain.  It  would  have  to  be  addressed  though  in  the  future. 

We  observe  that  the  total  share  of  the  AMR  in  runtime  is  on  the  order  of  1%  for  IMEX  methods  and  a 
factor  of  10  less  for  the  explicit  time  integration.  All  the  timings  for  the  AMR  part  are  basically  identical 
for  all  three  methods,  which  is  expected  as  we  call  the  AMR  subroutine  every  Is  of  the  simulation  time, 
regardless  of  the  time-step.  We  also  note  that  the  biggest  part  of  the  AMR  time  share  is  spent  in  evaluating 
the  refinement  criterion,  with  the  data  projection  cost  being  significantly  smaller  and  the  mesh  manipulation 
negligible.  We  disregard  the  cost  of  other  operations,  as  explained  above. 

For  the  sake  of  argument  one  can  say  that  if  we  checked  the  refinement  criterion  every  time  step,  the 
cost  of  AMR  would  rise  from  1%  to  10%.  That  would  be  true  if  the  simulation  required  mesh  modifications 
every  time  step.  As  shown  in  Fig.  12,  this  is  not  the  case  here,  as  even  with  refinement  check  every  10  (or 
100  for  RK35)  time-steps  we  reproduce  the  same  mesh  behavior  as  for  refinement  every  time-step.  Even 
then,  the  10%  cost  is  still  an  acceptable  one. 

The  evaluation  of  the  flux  over  the  non-conforming  faces  can  be  also  considered  an  overhead  of  AMR, 
however  it  is  more  difficult  to  quantify  how  much  of  the  cost  can  be  actually  attributed  to  AMR.  As  explained 
in  Sec.  6,  the  cost  of  evaluating  the  non-conforming  flux  does  not  only  include  computing  the  Rusanov  flux 
on  two  children  faces,  as  in  a  regular  conforming  case,  but  also  the  projections  from  and  to  the  parent  face. 
The  cost  of  the  operations  on  the  non-conforming  face  is  then  larger  than  the  cost  for  the  two  conforming 
faces  replaced  by  it.  On  the  other  hand,  the  non  conforming  face  allows  for  decreasing  the  number  of 
elements  in  the  domain,  and  therefore  limits  the  number  of  expensive  volume  integral  evaluations. 

8.2.  Case  2  -  Rising  thermal  bubble 

In  a  similar  fashion  to  Case  1,  we  start  the  rising  thermal  bubble  simulation  by  refining  the  level  0  mesh 
of  2  x  2  elements  uniformly  by  5  levels.  On  the  resulting  mesh  of  64  x  64  elements  we  generate  the  initial 
condition  described  in  Sec.  7.2.  The  polynomial  order  is  again  set  to  Np  =  5,  which  gives  the  effective 
resolution  of  3.125m.  The  smaller  domain  size  and  increased  resolution,  compared  to  Case  1,  imposes  a 
decreased  time  step.  We  choose  At  =  0.025s,  which  results  in  the  Courant  number  of  4.7.  This  set-up 
generates  a  problem  of  a  similar  size  to  Case  1,  but  with  significantly  increased  runtime  due  to  the  smaller 
time-step.  Also,  an  increased  Courant  number  will  put  more  effort  on  the  GMRES  solver,  as  the  iteration 
count  will  be  significantly  larger. 
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Figure  17:  History  of  element  count  for  different  refinement  threshold  settings  for  Case  2. 


Table  III:  Maximum  potential  temperature  perturbation  ( Omax ),  height  of  the  bubble  ( H 5)  and  simulation  runtime  ( Ts ). 


9t 

lib  [m] 

RK35 

®max  [K] 

Ts  [h] 

Hb  [m] 

BDF2 

@max  [K] 

Ts  [h] 

Hb  [m] 

ARK2 

@max  [K] 

Ts  [h] 

ref. 

961.09 

0.46110 

197.51 

960.16 

0.46124 

91.29 

961.27 

0.46064 

77.38 

0.001 

961.07 

0.46110 

58.51 

960.75 

0.46012 

28.50 

961.74 

0.45993 

23.64 

0.01 

961.05 

0.46116 

49.67 

960.88 

0.46031 

25.79 

961.81 

0.46001 

19.99 

0.1 

962.27 

0.46049 

35.48 

962.15 

0.45861 

19.43 

963.02 

0.45853 

14.98 

0.35 

968.92 

0.45795 

18.13 

967.76 

0.45551 

10.22 

968.34 

0.45759 

7.95 

Figure  17  shows  the  time  history  of  the  element  count  for  four  different  refinement  thresholds.  The 
behavior  of  the  mesh  is  similar  to  Case  1;  the  mesh  initially  does  not  change  much.  The  number  of  elements 
starts  growing  at  later  times.  For  high  thresholds  this  growth  is  subsequently  stopped  ( 9t  =  0.3  for  t  >  600s). 
The  period  of  the  initial  inactivity  is  longer  by  100s  than  for  Case  1,  and  has  a  much  larger  share  in  the 
total  simulation  time  (over  50%). 

8.2.1.  Accuracy  analysis 

Figure  18  compares  four  potential  temperature  perturbation  fields  at  t  =  700s  for  different  refinement 
thresholds.  Figure  18a  shows  that  the  lowest  threshold  refines  a  large  portion  of  the  domain  around  the 
bubble,  including  the  interior  of  the  mushroom  bubble  and  the  wake.  This  results  in  very  smooth  contour 
lines.  As  the  threshold  rises,  a  smaller  portion  of  the  domain  is  refined  and  the  contours  become  more  wavy 
indicating  some  instability.  Note  that  in  Figs.  18b  and  18c  we  see  that  the  outer  contours  of  the  mushroom 
are  refined  in  the  same  way,  but  the  mesh  within  the  bubble  is  very  different,  which  results  in  a  more  wavy 
mushroom  pattern  in  Fig.  18c.  This  indicates  that  not  only  the  strong  gradient  zone,  but  also  the  internal 
mushroom  region  is  important  for  this  case.  The  solution  in  Fig.  18d  is  the  worst  case  scenario  where  the 
general  bubble  shape  is  still  recreated,  but  the  solution  is  not  smooth  and  some  mesh  imprinting  can  be 
seen  at  the  interfaces  of  big  elements  in  the  center  of  the  mushroom.  In  this  case  only  the  highest  potential 
temperature  perturbation  areas  are  refined  and  the  lack  of  refinement  in  the  rest  of  the  bubble  affects  the 
solution  in  the  refined  region  significantly. 

Table  III  summarizes  the  maximum  potential  temperature  perturbation  in  the  domain  at  time  T  =  700s, 
as  well  as  the  position  (height)  of  the  bubble  at  that  time.  The  position  of  the  bubble  is  defined  as  a 
vertical  coordinate  of  the  top-most  cross-section  of  9  =  0.1K  isoline  and  x  =  500nr  central  axis  of  the 
bubble.  Additionally  the  runtime  for  all  the  cases  using  different  time  integration  methods  is  presented  for 
comparison.  The  runtime  was  measured  for  simulations  without  compiler  optimizations  and  preconditioning. 

For  all  time  integration  methods,  both  the  bubble  height  and  9max  follow  the  reference  solution  closely 
for  low  9t  and  diverge  for  larger  values  of  the  refinement  threshold.  The  simulation  with  9t  =  0.01  matches 
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Figure  18:  Potential  temperature  perturbation  contours  at  t  =  700s  for  (a)  9t  =  0.001,  (b)  9t  =  0.01,  (c)  9t  =  0.1,  and  (d) 
9t  =  0.35. 


the  reference  solution  almost  perfectly,  but  obtains  this  result  in  a  significantly  shorter  simulation  time 
(  25%).  In  this  case  IMEX  methods  run  with  a  significantly  higher  Courant  number  than  in  Case  1,  which 
causes  the  reference  simulations  for  both  BDF2  and  ARK2  to  differ  slightly  from  the  explicit  one.  Similarly, 
the  AMR  results  for  both  ARK2  and  BDF2  are  not  exactly  the  same  as  their  explicit  counterparts.  The 
ARK2  method  proves  to  be  the  fastest  among  the  three. 

In  Fig.  19  the  L2  norms  of  the  potential  temperature  perturbation  for  two  IMEX  methods  are  presented. 
The  norms  were  computed  using  the  reference  simulation  for  each  method  (BDF2  AMR  simulations  com¬ 
pared  with  BDF2  reference,  likewise  for  ARK2).  Similarly  to  the  Case  1  results,  the  error  does  not  differ 
much  among  methods,  which  indicates  that  AMR  affects  the  accuracy  of  both  methods  in  the  same  way.  The 
level  of  error  is  significantly  higher,  though,  than  in  the  density  current  case.  This  is  because  the  artificial 
viscosity  applied  to  stabilize  this  flow  is  significantly  smaller  than  for  the  density  current  case.  Typically 
the  viscosity  for  Case  1  is  p  =  75m2/s  while  for  Case  2  it  is  fi  =  0.1m2 /s.  This  amount  of  viscosity  is 
enough  to  stabilize  the  flow,  but  is  barely  enough  to  guarantee  a  smooth  solution  for  the  effective  resolution 
of  Ax  =  3.125m  (see  [32]).  A  more  in-depth  analysis  of  the  rising  thermal  bubble  case  with  AMR  will  follow 
in  an  upcoming  paper. 

8.2.2.  Performance  analysis 

Figure  20  shows  the  simulation  runtime  (Fig.  20a)  and  speed-up  (Fig.  20b)  of  AMR  simulations  for 
different  refinement  thresholds.  Solid  lines  represent  unoptimized  simulations,  while  dashed  lines  show  the 
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Figure  19:  L2  error  norms  for  RK35  (blue  line  with  circles),  BDF2  (green  line  with  squares)  and  ARK2  (red  line  with  crosses) 
for  Case  2. 

(a)  (b) 


element  count 


Figure  20:  (a)  Wall  clock  time  as  a  function  of  average  element  count  and  (b)  speed-up  of  AMR  simulations  as  a  function  of 
ER  for  different  IMEX  methods  for  Case  2.  Green  line  with  squares  represents  BDF2;  red  line  with  crosses  shows  the  result  for 
ARK2;  blue  line  with  circles  marks  the  timing  of  the  explicit  RK35  method.  Dashed  lines  represent  the  code  optimized  using 
compiler  flags,  solid  lines  show  the  unoptimized  case.  The  solid  black  line  shows  the  expected  ideal  speed-up. 


performance  of  the  compiler  03  optimizations.  As  observed  already  in  Table  III,  ARK2  was  the  fastest 
one  for  this  case  for  both  optimized  and  unoptimized  runs.  The  speed-up  plot  for  unoptimized  simulations 
looks  similar  to  those  for  Case  1,  where  explicit  RK35  obtains  nearly  ideal  speed-up,  followed  by  ARK2  and 
BDF2.  This  time  the  difference  between  IMEX  methods  is  not  as  pronounced.  The  compiler  optimized  runs 
again  show  speed-ups  exceeding  expectations. 

The  greater  speed  of  ARK2  simulations  can  be  credited  to  an  increased  Courant  number  (compared 
to  Case  1)  and  the  absence  of  preconditioning.  ARK2  is  a  two  stage  method,  where  each  of  the  stages 
performs  significantly  less  GMRES  iterations  than  a  single  BDF2  time-step.  The  total  number  of  iterations 
per  time  step  is  presented  in  Fig.  21.  Since  the  cost  of  GMRES  is  0{k 2)  (where  k  is  the  number  of  GMRES 
iterations)  that  alone  accounts  for  an  increased  runtime.  On  the  other  hand,  the  use  of  preconditioners 
could  significantly  bring  down  the  number  of  iterations  for  BDF2,  while  ARK2  may  not  benefit  as  much 
due  to  the  relatively  small  iteration  count  at  each  stage.  We  discuss  the  impact  of  preconditioners  in  the 
following  section. 

Figure  21  shows  the  average  number  of  GMRES  iterations  for  both  IMEX  methods.  For  ARK2  we 
present  the  sum  of  iterations  from  both  implicit  stages.  In  both  cases  AMR  causes  the  increase  of  the 
iteration  count,  however  it  is  much  more  pronounced  for  the  BDF2  method.  Contrary  to  the  Case  1  result, 
this  does  not  translate  into  a  significant  difference  in  the  speed-up  plots. 

Looking  at  Fig.  22  we  see  that  the  previously  reported  spikes  in  the  GMRES  iteration  count  (for  Case 
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Figure  21:  Average  number  of  GMRES  iterations  as  a  function  of  ER  for  different  IMEX  methods  for  Case  2.  Green  line  with 
squares  represents  BDF2;  red  line  with  crosses  shows  the  result  for  ARK2. 

1)  are  not  as  significant  for  Case  2.  The  iteration  count  for  the  BDF2  AMR  simulation  (bottom  plot  of  Fig. 
22a)  is  larger  compared  to  the  reference  case  (top  plot).  The  spikes,  however,  mainly  occur  while  de-refining 
the  mesh  (Fig.  22b)  and  do  not  significantly  contribute  to  the  overall  average  iteration  count.  Very  small 
spikes  can  be  noted  when  the  element  count  increases,  but  compared  to  the  average  iteration  number  this 
increase  is  negligible.  For  ARK2  the  AMR  GMRES  iteration  time  history  (bottom  plot  of  Fig.  22c)  much 
more  closely  follows  the  reference  case  (top  plot  of  Fig.  22c),  which  indicates  more  robustness  of  this  method 
with  respect  to  the  AMR  algorithm. 

8.2.3.  Preconditioning 

In  any  real  application  IMEX  methods  are  used  in  combination  with  a  preconditioner  in  order  to  im¬ 
prove  the  conditioning  of  the  system  and  bring  the  GMRES  iteration  count  down.  Of  course,  the  cost  of 
computing  and  applying  the  preconditioner  has  to  be  smaller  than  the  cost  saved  by  decreasing  the  iteration 
count.  An  added  complexity  occurs  in  the  AMR  case,  because  the  mesh  keeps  changing  and  therefore  the 
preconditioner  has  to  adapt  to  those  changes.  Here  we  present  only  a  very  preliminary  discussion  of  the  use 
of  preconditioners  in  the  AMR  simulations.  We  apply  the  element-based  spectrally-optimized  approximate 
inverse  preconditioner  described  in  [41]. 

Figure  23  presents  the  GMRES  iteration  count  for  both  IMEX  methods  with  preconditioning  (dashed 
line)  and  without  (solid  line).  For  both  BDF2  and  ARK2,  the  preconditioner  brought  down  the  number  of 
iterations  by  a  similar  factor.  The  decrease  of  the  iteration  count  does  not  translate  to  decrease  in  runtime 
for  both  methods  equally  well,  which  can  be  observed  in  Fig.  24a.  For  BDF2  the  use  of  preconditioning 
resulted  in  noticeably  faster  simulations,  while  for  ARK2  the  speed-up  was  not  as  pronounced.  Interestingly, 
the  reference  simulations  benehtted  more  from  the  preconditioner  (the  rightmost  points  on  the  graph).  This 
is  due  to  the  fact  that  without  AMR  the  preconditioner  has  to  be  computed  only  once  at  the  beginning  of 
the  simulation  and  does  not  need  to  be  recomputed.  For  AMR  simulations  the  recomputing  of  the  precon¬ 
ditioner  adds  a  significant  overhead.  It  is  apparent  in  Fig.  24b,  where  the  speed-up  lines  for  preconditioned 
simulations  are  significantly  lower  than  the  speed-up  for  non-preconditioned  cases. 

It  is  worth  noting  that  this  is  just  a  preliminary  study  of  the  influence  of  preconditioning  on  the  AMR 
simulations.  At  the  moment  we  recompute  the  preconditioner  each  time  the  mesh  is  modified,  even  if  only 
one  of  the  elements  is  changed.  Further  study  will  reveal  if  it  is  possible  to  limit  this  overhead.  Additionally, 
this  preconditioner  was  designed  for  the  Schur-form  CG  method  and  is  applied  to  no-Schur  form  DG  method, 
therefore  it  may  not  be  as  beneficial  for  the  cases  at  hand.  We  will  address  these  issues  in  an  upcoming 
paper. 

8.2.4-  Mass  conservation 

Similarly  to  Case  1,  the  mass  conservation  error  for  the  AMR  simulations  for  Case  2  is  bounded  by  the 
error  for  the  reference  case.  Figure  25  shows  the  same  sensitivity  of  M  to  an  increasing  number  of  elements 
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Figure  22:  Time  history  of  GMRES  iterations  (k)  for  reference  (top  figure  on  each  panel)  and  Ot  =  0.1  (bottom  figure  on  each 
panel)  simulations  for  (a)  BDF2  and  (c)  ARK2  simulation.  Panel  (b)  contains  the  close-up  of  the  time  history  of  k  between 
t  =  280s  and  300s  for  BDF2  simulation  (top  figure)  and  the  time  history  of  the  element  count  (Ne  on  the  bottom  figure). 


Table  IV:  Corrected  coefficients  for  the  RK35  method.  *  marks  the  difference  from  [42]. 


stage 

a 

P 

1 

i 

0 

0 

0.377268915331368 

2 

0 

1 

0 

0.377268915331368 

3 

0.355909775063326* 

0.644090224936674 

0 

0.242995220537396 

4 

0.367933791638137 

0.632066208361863 

0 

0.238458932846290 

5 

0 

0.762406163401431 

0.237593836598569 

0.237593836598569 

over  simulation  time.  All  three  time  integration  methods  give  consistent  mass  conservation  behavior.  During 
the  study  an  important  feature  of  the  explicit  RK35  method  has  been  noted,  which  can  be  extended  to  other 
time  integration  methods.  It  is  important  that  the  coefficients  of  each  stage  of  the  Runge-Kutta  method  are 
consistent  up  to  a  desired  precision.  If  we  aim  for  double  precision  and  therefore  are  looking  for  round-off 
errors  at  the  10-16  level,  the  coefficients  have  to  be  consistent  up  to  the  16t,i  decimal  place.  This  is  not 
the  case  with  the  coefficients  of  the  third  stage  of  RK35  published  in  Ruuth  [42] .  In  the  case  of  this  study 
a  small  inconsistency  at  the  last  decimal  place,  which  is  probably  the  result  of  a  roundoff  error,  led  to  a 
steady  growth  of  the  M  norm.  In  Table  IV  we  provide  consistent  coefficients  for  the  RK35  method  which 
fixed  this  steady  growth. 

8.2.5.  AMR  cost  breakdown 

Compared  to  the  results  for  Case  1,  the  share  of  the  cost  of  AMR  for  IMEX  methods  in  Case  2  is 
significantly  smaller.  This  is  because  the  time  integration  became  more  expensive  due  to  the  increased 
Courant  number.  Also,  the  criterion  evaluation  cost  with  regards  to  the  total  AMR  cost  is  much  bigger  than 
in  the  previous  case,  especially  for  6t  =  0.01  simulations.  This  is  because  we  are  checking  the  refinement 
criterion  every  10  iterations,  which  translates  to  0.25s  of  simulation  time.  We  therefore  check  for  refinement 
more  often  than  actually  perform  any  mesh  modification.  It  is  clear  that  the  refinement  check  frequency 
parameter  could  have  been  chosen  more  optimally.  Note,  however,  that  for  this  case  the  cost  of  time 
integration  is  so  high  that  even  more  frequent  checks  would  not  cause  the  cost  of  AMR  to  exceed  a  few 
percents. 
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Figure  23:  Average  number  of  GMRES  iterations  as  a  function  of  ER  for  different  IMEX  methods  for  Case  2  with  (dashed 
line)  and  without  preconditioning  (solid  line).  Green  line  with  squares  represents  BDF2;  red  line  with  crosses  shows  the  result 
for  ARK2. 


(a) 


(b) 


Figure  24:  (a)  Runtime  and  (b)  speed-up  of  compiler  optimized  IMEX  simulations  for  Case  2  with  (dashed  line)  and  without 
preconditioning  (solid  line).  Green  line  with  squares  represents  BDF2;  red  line  with  crosses  shows  the  result  for  ARK2. 


9.  Conclusion 

We  presented  the  details  of  the  AMR  algorithm  for  the  discontinuous  Galerkin  version  of  the  Non¬ 
hydrostatic  Unified  Model  of  the  Atmosphere  (NUMA)  and  sought  to  analyze  the  benefit  of  AMR.  To  this 
end,  we  compared  the  accuracy  versus  the  speed-up  of  AMR  with  respect  to  uniform  grid  simulations. 
The  AMR  algorithm  was  tested  using  a  threshold  based  criterion  for  both  the  density  current  and  rising 
thermal  bubble  cases.  We  have  investigated  the  performance  of  the  AMR  algorithm  running  with  three  time 
integration  methods:  explicit  RK35  and  two  IMEX  methods:  BDF2  and  ARK2.  For  all  the  methods  the 
accuracy  of  the  result  depends  on  the  choice  of  the  threshold  for  the  refinement  criterion.  For  higher  element 
ratios  (ER)  the  criterion  does  not  capture  all  the  features  of  the  solution,  therefore  the  L2  error  increases 
from  1CU6  for  low  threshold  simulations  to  10-4  for  high  threshold  ones  for  the  density  current  case.  The 
most  significant  features  of  the  solution  such  as  the  front  position,  minimum  potential  temperature  or  the 
general  structure  of  the  front  are  captured  correctly  even  for  high  ER  simulations.  The  ARK2  method 
reproduces  closely  the  result  of  RK35,  while  BDF2  provides  a  less  similar  agreement. 

For  the  rising  thermal  bubble  case  the  general  features  of  the  solution  are  captured  correctly  by  all  the 
simulations  except  the  one  with  the  highest  threshold  value.  In  that  case  the  boundary  of  the  bubble  is 
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(c) 


(d) 


Figure  25:  Mass  conservation  error  for  Case  2  for  (a)  reference  simulation,  (b)  6t  =  0.01  AMR  simulation  and  (c)  0t  =  0.35 
AMR  simulation  for  three  time  integration  methods.  Panel  (d)  shows  the  time  history  of  the  element  count  for  all  three 
simulations  using  all  three  methods.  Results  from  different  simulations  with  the  same  time-integration  method  overlap,  which 
is  expected. 


Table  V:  Timings  (in  seconds  of  runtime)  and  percentage  breakdown  of  different  AMR  components  for  Case  2 


RK35 

6t  =  0.001  0t  =  4.0 

BDF2 

<9,  =  0.001  et  =  4.0 

ARK2 

6t  =  0.001  6t  =  4.0 

total 

27994  (100%) 

9710  (100%) 

18938  (100%) 

7154  (100%) 

13245  (100%) 

4687  (100%) 

volume  integrals 

13954  (49.8%) 

4740  (48.8%) 

4990  (26.3%) 

1829  (25.6%) 

4618  (34.9%) 

1655  (35.3%) 

face  integrals 

2560  (9.14%) 

678.9  (7%) 

1540  (8.13%) 

385.7  (5.39%) 

1369  (10.3%) 

342.6  (7.31%) 

non-conforming  faces 

677.7  (2.42%) 

503.6  (5.19%) 

535.4  (2.83%) 

440.6  (6.16%) 

460.3  (3.47%) 

374.4  (7.99%) 

AMR: 

21.7  (0.08%) 

8.5  (0.09%) 

22.2  (0.12  %) 

8.49  (0.11%) 

21.8  (0.16%) 

8.29  (0.18%) 

criterion  evaluation 

16.8 

5.9 

16.8 

5.9 

16.8 

5.93 

mesh  manipulation 

0.15 

0.16 

0.15 

0.16 

0.15 

0.15 

data  projection 

1.76 

0.86 

2.39 

1.13 

1.93 

0.9 

other 

2.5 

1.11 

2.7 

1.16 

2.81 

1.22 

clearly  disturbed  and  some  mesh  artifacts  are  present  in  the  solution.  It  was  discovered  that  not  only  the 
refinement  in  the  high  temperature  gradient  area  at  the  boundary  of  the  bubble  needs  to  be  resolved,  but 
the  resolution  of  the  interior  of  the  bubble  plays  a  role.  All  the  important  features  such  as  the  position 
of  the  bubble  and  maximum  potential  temperature  were  captured  very  accurately  by  the  lower  refinement 
thresholds  at  a  fraction  of  the  cost. 

The  performance  of  the  algorithm  was  investigated  by  comparing  the  runtime  and  speed-up  of  differ¬ 
ent  AMR  simulations.  The  explicit  RK35  showed  nearly  perfect  speed-up  which  confirms  that  the  AMR 
algorithm  is  indeed  very  efficient.  In  the  cases  shown  in  the  paper  the  overall  cost  of  the  adaptive  mesh 
algorithm  is  below  1%  of  the  total  runtime.  The  main  component  of  the  AMR  overhead  is  the  evaluation 
of  the  refinement  criterion,  while  the  mesh  manipulation  and  data  projections  have  negligible  cost.  For 
the  density  current  test  case  (Case  1)  with  BDF2  an  additional  big  overhead  was  the  increased  GMRES 
iteration  count  caused  by  AMR.  The  BDF2  speed-up  turned  out  to  be  the  worst  of  all  the  three  methods. 
Both  IMEX  methods  had  runtimes  about  an  order  of  magnitude  smaller  than  RK35.  ARK2  was  shown  to 
be  slightly  slower  than  BDF2  using  the  same  time  step,  but  with  much  increased  accuracy  and  improved 
speed-up  properties.  The  ARK2  method  was  shown  to  be  more  robust  (less  sensitive)  to  the  changing  mesh 
caused  by  the  AMR  algorithm.  For  the  rising  thermal  bubble  test  case  (case  2),  the  ARK2  was  the  fastest 
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method  due  to  the  increased  Courant  number.  The  difference  in  speed-up  curves  was  not  as  pronounced, 
but  again  the  ARK2  proved  to  have  better  properties  and  less  overhead  than  BDF2. 

In  nonhydrostatic  atmospheric  applications  (when  using  the  fully  compressible  equations),  IMEX  meth¬ 
ods  are  an  absolute  necessity,  due  to  a  very  strict  constraint  imposed  on  the  explicit  time-step  by  acoustic 
waves.  This  study  shows  that  the  performance  of  some  IMEX  methods  can  be  affected  by  a  dynamically 
adaptive  mesh.  It  is  important  not  only  to  design  the  AMR  algorithm  in  an  efficient  way,  but  also  to  consider 
which  time  integration  method  to  use  in  such  applications.  In  our  future  investigations  we  will  pursue  the 
ARK2  method  due  to  its  good  performance  properties  and  excellent  accuracy. 

The  mass  conservation  error  for  both  test  cases  and  all  the  time  integration  methods  was  shown  to  be 
bounded  by  the  mass  conservation  error  of  the  reference  simulations.  For  adaptive  simulations  the  mass 
conservation  was  affected  by  the  changing  number  of  elements  in  the  mesh. 

A  preliminary  study  of  the  influence  of  preconditioning  on  the  AMR  simulations  showed  that  it  is 
another  significant  source  of  overhead,  mainly  because  the  preconditioner  was  recomputed  after  every  mesh 
modification.  We  will  pursue  the  possibilities  of  reducing  this  overhead  in  future  work  since  it  may  not  be 
necessary  to  always  recompute  the  preconditioner. 

Finally,  the  effect  of  the  compiler  optimizations  on  the  algorithm  performance  was  investigated.  The 
optimizations  provide  significant  runtime  reduction  of  the  simulations,  as  big  as  a  factor  of  10.  Additionally, 
optimization  affects  the  speed-up  of  the  AMR  algorithm,  as  more  efficient  memory  handling  for  smaller 
problems  allows  for  speed-ups  greater  than  it  would  follow  from  the  simple  performance  model.  In  a  sense 
the  optimizations  more  than  make  up  for  the  overhead  caused  by  the  AMR. 
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Appendix  A 

ID  projection 

The  integral  projection  method  presented  in  this  section  was  proposed  by  [29]  for  general  h-p  non- 
conforming  elements.  Here  we  describe  the  method  applied  to  a  specific  fi-non-conforming  edge  in  2:1 
balance. 
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Let  £  €  [—1, 1]  denote  the  coordinate  in  the  standard  element  space  corresponding  to  the  parent  (left) 
side  of  the  edge.  Define  z^^z^  €  [—1,1]  as  the  coordinates  of  standard  elements  corresponding  to  two 
children  (right)  sides  of  the  edge.  Let 


z 


(!)  = 


g  -  0w 

s 


z&  = 


s  -  qW 

s 


(21) 


be  a  map  £  — ►  z^k\k  =  1,2  from  the  parent  space  to  children  spaces,  where  o ^  is  the  offset  parameter 
for  the  child  k  and  s  is  the  scale  parameter.  In  our  case  o*-1-*  =  —0.5,  o ^  =  0.5  and  s  =  0.5.  The  inverse 
mapping  z ^  — »  £  is 

£  =  s-*(fc)+ o(fc),  As  =  1,2.  (22) 

We  can  now  expand  the  variables  using  a  polynomial  basis  as  follows: 


qL(0 


qLk{z{k) ) 
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3  =  0 
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3=0 


k  =  1,2. 


By  substituting  (22)  into  (23)  we  get 


qL(z{k)) 


^^•(s-^+oW), 


3=0 


k  =  1,2. 


(23) 

(24) 


(25) 


In  order  to  perform  projection  from  a  parent  side  to  two  children  sides  of  the  edge  we  require  that  for 
both  children  sides 


l 


-l 


Substitution  of  (24)  and  (25)  to  (26)  and  rearranging  yields 


Lk 


)i/>i(z{k))dzW 


(27) 


Since  z ^  £  [—1,1]  regardless  of  k,  we  can  write  2  =  z^k\  which  simplifies  the  notation.  The  terms  in 
brackets  can  be  represented  in  matrix  form  as 

l  l 

M ij  =  J 'il>i(z)'il)j(z)dz,  S =  J  il>i(z)i/>j(s  •  z  +  o^)dz,  k  =  1,2,  (28) 

-l  -l 


which  simplifies  equation  (27)  to 


M,j9“-S^=0,  As  =  1,2. 


>-i3  Hj 


13  ^ 3 


(29) 


Note  that  M,y  is  the  standard  ID  mass  matrix,  which  is  easily  invertible.  If  P fk  =  M^S^,  then 

qkk  =  Pfkqf,  Ac  =  1,2 


(30) 


and  we  call  PSk  the  scatter  projection  matrix. 
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Similarly  the  gather  projection  from  two  children  sides  to  a  parent  side  is  performed  (see  Fig.  5b).  We 
require  that  on  the  parent  side  of  the  face 


1 

J  (qR(0-QR(0)M m  =  o, 


(31) 


where  qR  is  the  continuous  projection  of  the  variables  qR1  and  qR2  from  children  sides  to  a  parent  side,  and 


r,Rl 


qR(0  = 


( 2 (1))  =  q 


-  r,Rl  / 


(1)' 


for  —  1  <  £  <  0 
qR2(z (2))  =  qR 2  (  f— )  for  0+  <  £  <  1. 


n<2> 


(32) 


Note  that  qR(£)  allows  for  a  discontinuity  at  £  =  0.  Substituting  (32)  into  (31)  yields 


I  ( qR(Z)-qR1  +  /  ( qR(0-qR 2  =  ° 


f-o^' 


-1  0 
Using  an  expansion  analogous  to  (24),  (25)  and  rearranging  we  get 


J  MO^AO^J  qf  -  Ij/"  4>A£)i,j  °  )  qR1  -  ( 


—  o ^ 
s 


df  I  qf  =  0. 


We  introduce  the  variable  change  £  =  s  ■  z  +  o<k) ,  df  =  s  ■  dz  to  the  second  and  third  integral,  that  gives 


J i>i(€)i>AZ)dZj  dR  ■ z  +  o(k))^j(z)dzj  qRk  =  o. 


(33) 


Note  that  the  term  in  brackets  to  the  left  of  qRk  is  the  transpose  of  defined  in  (??).  We  can  write  the 
integrals  in  matrix  notation: 


Mijq?-s'Esifq?k  =  °, 


(34) 


k= 1 


where  M,y  is  the  mass  matrix  as  defined  in  (28) .Finally,  if  we  define  P fk  =  s  •  to  be  the  gather 

projection  matrix,  then 


k= 1 


(35) 


2D  projection 

The  2D  projection  presented  here  is  a  two-dimensional  extension  of  the  integral  projection  method 
presented  in  the  previous  section. 

Figure  6  presents  the  standard  element  corresponding  to  the  parent  element  with  coordinates  (£,  rf)  € 
[—1,1] 2  and  four  children,  each  with  separate  coordinates  (z[k\z^)  €  [— 1,1]2,  where  k  =  1, . . . ,  4.  We 
define  a  map: 
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where  o[k>  and  are  offset  parameters  corresponding  to  each  child  element  k  and  coordinate  Z\  and  z2- 
The  inverse  mapping  is  now 


(  =  s  ■  z 
r]  =  s  ■  z 


(fc) 
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(fc) 

2 


+  O 


(fc) 

1  ) 


Tfc) 


(37) 

(38) 


where  k  =  1, . . . ,  4,  the  scale  parameter  s  =  0.5  and  offsets  =  ±0.5  depending  on  direction  i  and  element 
number  k. 

Each  element  has  a  polynomial  basis  defined  in  which  we  expand  the  projected  variable, 


<7P  (£,??) 


Mjv 


e  ±r  i’j&v), 

i= 1 


Mjv 

E 

i=i 


V’j(«ifc). 


(39) 

(40) 


where  is  the  parent  element  variable  and  qck  is  the  /c-th  child  element  variable  projected  from  qp ,  and 
Mn  is  the  number  of  nodal  points  in  the  element.  Similarly  as  in  Equation  (25)  we  can  substitute  the 
inverse  map  (37)  to  (39)  and  represent  the  parent  variable  in  terms  of  the  children  coordinate  system. 
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For  each  k  =  1, . . . ,  4  we  require  that 
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zik)]dz[k)dzik)  =  0. 


(42) 


Substituting  expansions  (39),  (40),  rearranging  and  employing  matrix  notation  yields 

Myf  -  S {k)qp  =  0,  (43) 

where 


My 
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J  J  ip j{s  ■  z\  +  o^,  s  ■  z2  ±  o^)ipi{z\,  z2)dzidz2- 
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(44) 


(45) 


The  projection  matrix  is  once  again  constructed  by  inverting  the  mass  matrix  and  left-multiplying  the  inverse 
with  Eq.  (43).  This  yields 

q?k  =  (Pf£)««f ,  fc  =  l,...,4,  (46) 

where 

pf£  =  M_1S(fc).  (47) 

Similarly  as  in  the  case  of  the  gather  projection  in  the  ID  non-conforming  edge  case,  the  2D  gather 
projection  matrix  is  constructed  by  multiplying  the  inverse  of  the  mass  matrix  and  transpose  of  2D 

Pg  =s-M-1S(fe)T, 
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(48) 


which  yields: 


gf  =  £(p  2nhqfk-  (49) 

fc= i 

This  approach  is  easily  extendable  to  3D  projections  of  hexahedral  elements  since  the  same  tensor-product 
operations  are  being  applied  in  ID,  2D,  or  3D. 
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